One of the recommendations to thwart SPOFs is to build redundancies. Multiple Instances a critical component (e.g. power supply, network connection, DNS server) operate in parallel. If either fails, the system continues to operate without loss of performance.
Redundancy also avoids many SPOFs on the software side. We can, for example, cite the opposition between popular microservices on the one hand and monolithic software on the other. A system composed of microservices is decoupled and less complex, which makes it more robust against SPOFs. Since microservices are launched as containers, it makes it easy to build redundancies.
But how exactly does redundancy protect a system? Let’s use the reliability estimate of a system, also known as Lusser’s Law to illustrate. Here is an example of reflection:
Consider that a system has two independent and parallel connections to a source of electricity. Also consider that the probability that the connection fails within a given time is 1%. Therefore, the probability of a complete failure of the electricity connection can be calculated as the product of the two probabilities:
- Instance failure probability:
1% = 1/100 = 1/10^2 = 0.01
- Probability that two instances fail one after the other:
1% * 1% = (1/10^2)^2 = 1/10^4 = 0.0001
As you can see, the probability of a SPOF occurring is not halved when two instances are running, but reduced by two powers of 10. This is a considerable improvement. If three instances are running in parallel, failure of the entire system should be near impossible.
Unfortunately, redundancy is not a panacea. Rather, it can be said to protect a system against SPOFs within certain assumptions. To begin with, the probability that an instance will fail must be independent the probability of failure of the instance or redundant instances. This is not the case when the failure is due to an external event. If a data center catches fire, the redundant components will fail together.
In addition to the redundancy of deployed components, the distribution of certain components is critical to mitigate SPOFs. The geographical distribution of data storage and IT infrastructure protects against natural disasters. Moreover, aiming for a certain heterogeneity or diversity of the critical components of a system often pays off. Diversity reduces the likelihood that redundant instances will fail.
Let’s illustrate theadvantage of diversity using the example of cybersecurity. Imagine a data center with redundant load balancers designed on the exact same model. A security vulnerability in one of the load balancers will also present itself in the redundant instances. In the worst case, an attack will cripple all instances. By using different models, the system as a whole will have a better chance of continuing to operate at a reduced level of performance.