A Thorough Guide to High Availability, HA Cluster Architecture & Fault Tolerance
Highly available computing infrastructure is the norm in the computing industry today. More so, when it comes to the cloud platforms, it’s the key feature which enables the workloads running on them to be highly available.
This write-up is an insight into high availability. It covers all our questions about it such as what is it? Why is it so important to businesses? What is a highly available cluster? How do cloud platforms ensure high availability of the services running on them? What is fault tolerance & redundancy? How are they related to high availability?
So, without any further ado.
Let’s get on with it.
1. What is High Availability?
High availability also known as HA is the ability of the system to stay online despite having failures at the infrastructural level in real time. High availability ensures the uptime of the service much more than the normal time. It improves the reliability of the system, ensures minimum downtime.
The sole mission of highly available systems is to stay online & stay connected. A very basic example of this is having back up generators to ensure continuous power supply in case of any power outages.
In the industry HA is often expressed as a percentage, for instance, the system is 99.99999% highly available simply means 99.99999% of the total hosting time the service will be up. You might often see this in the SLA service level agreements of cloud platforms.
2. How Important Is High Availability to Online Services?
It might not impact businesses that much if social applications go down for a bit & then bounce back but there are mission-critical systems like aircraft systems, spacecrafts, mining machines, hospital servers, finance stock market systems that just cannot afford to go down at any time. After all, lives depend on it.
Mission critical systems smooth functioning relies on the continual connectivity with their network/servers. These are the instances when we just cannot do without super highly available infrastructures.
Besides no service likes to go down, critical or not.
To meet the high availability requirements systems are designed to be fault tolerant, their components are made redundant.
What is fault tolerant & redundancy in systems designing? I’ll talk about all that ahead in the article.
3. Reasons for System Failure
Before delving into the HA system design, fault-tolerance and redundancy. I’ll first talk about the common reasons why systems fail.
I am sure you are pretty familiar with software crashes. Applications crash all the time, be it on a mobile or a desktop.
Corrupt software files. OS crashing, remember BSOD blue screen of death in windows? Likewise, software running on cloud nodes crash unpredictably, along with it they take down the entire node.
Another reason for system failure is hardware crashes. Overloaded CPU, RAM, nodes going down. Network outages.
This is the biggest reason for system failures. Flawed configurations & stuff.
Besides the unplanned crashes, there are planned downtimes which involve routine maintenance operations, patching of software, hardware upgrades etc.
These were the primary reasons for system failures, now let’s talk about how HA systems are designed to overcome these scenarios of system downtime.
4. How to Achieve High Availability? HA System Design Approaches
There are several approaches to achieve HA. The most important of them is to make the system fault tolerant.
4.1 What is Fault Tolerance?
Fault tolerance is the ability of the system to stay up despite taking hits. Nodes going down, on & off. The system is equipped to handle faults. Being fault tolerant is an essential element in designing life-critical systems.
In the case of internal failures, the system could work at a reduced level but it will not go down entirely.
A very basic example of a system being fault-tolerant is a social networking application. In the case of backend node failures. A few services of the app such as image upload, post likes etc. may stop working. But the application as a whole will still be up. This approach is also technically known as Fail Soft.
4.2 Designing Highly Available Fault-Tolerant Services – Architecture
At the application level to achieve high availability, the entire massive service is architecturally broken down into smaller loosely coupled services called the microservices.
There are many upsides of splitting a big monolith into several microservices, easier management, easier development, ease of adding new features, maintenance & stuff. But the biggest upside is high availability.
Every microservice takes the onus of running different features of an application such as image upload, comment, instant messaging etc.
So, even if a few services go down the application as a whole is still up.
At the infrastructural level, there are many ways to make our systems fault-tolerant.
Redundancy – Active Passive HA Mode
Redundancy is duplicating the components/ instances & keeping them on standby to take over in case the active instances go down. It’s the fail-safe, backup mechanism.
In the above diagram, you can see the instances active & on standby. The standby instances take over in case any of the active instances goes down.
This approach is also known as Active Passive HA mode. An initial set of nodes are active & a set of redundant nodes are passive, on standby. Active nodes get replaced by passive nodes, in case of failures.
There are systems like GPS, aircrafts, communication satellites which have zero downtime. The availability of these systems is ensured by excessively making the components redundant.
Getting Rid of Single Points of Failure
Distributed systems got so popular solely due to the reason that with them we could get rid of the single points of failure in a monolithic architecture.
A large number of distributed nodes work in conjunction with each other to achieve a single synchronous application state.
When so many redundant nodes are deployed, there are no single points of failure in the system.
In case a node goes down redundant nodes take its place. The system as a whole remains unimpacted.
Single points of failure at the application level mean bottlenecks. We should detect bottlenecks in performance testing & get rid of them as soon as we can. For more info on this. Do go through this scalability 101 write-up.
Monitoring & Automation
Systems should be well monitored in real-time to detect any bottlenecks or single point of failures. Automation enables the instances to self-recover without any human intervention. It gives the instances the power of self-healing.
Also, the systems become intelligent enough to add or remove instances on the fly as per the requirements. Read how does autoscaling work in the cloud?
Since the most common cause of failures is human error, automation helps cut down failures by a big extent.
Replication – Active-Active HA mode
Replication means having a number of similar nodes running the workload together. There are no standby or passive instances. When a single or a few nodes go down, the remaining nodes bear the load of the service. Think of this as load balancing.
This approach is also known as the Active-Active High Availability mode in which all the components of the system are active at any point in time.
Geographical Distribution Of Workload
As a contingency for natural disasters, data centre regional power outages & other big scale failures workloads are spread across different data centres across the world in different geographical zones.
This avoids the single point of failure thing in context to a data centre, also the latency is reduced by quite an extent due to the proximity of data to the user.
All the highly available fault tolerant design decisions are subjective to how critical is the system? What are the odds that the components will fail? Etc.
Businesses often use multi-cloud platforms to deploy their workloads which ensures further availability. If things go south with one cloud provider, they have another to fail back over.
5. What is High Availability Clustering?
Well now, we are clear on high availability, let’s zoom out of the architecture one bit. Till now to ensure HA we have multiple machines running together. Now we have clusters of HA machines running together to ensure HA at the cluster level.
Multiple HA clusters may run together in one geographical zone ensuring minimum downtime & continual service. All the clusters maintain a single state by the help of a shared distributed memory.
6. More On the Blog
Well, Guys!! This is pretty much it on high availability. If you liked the article, share it with your folks.
You can follow 8bitmen on social media. Subscribe to the browser notifications to stay notified on the new content.
I’ll see you in the next write-up.
- What is Lift & Shift Migration to the Cloud? – An In-Depth Insight
- Distributed Cache 101 – The Only Guide You’ll Ever Need
- Facebook Real-time Chat Architecture Scaling With Over Multi-Billion Messages Daily
- Twitter’s Migration to Google Cloud – An Architectural Insight
- What Is an Instance In Cloud Computing? – A Thorough Guide