High availability (HA) is the ability of a system or system component to be continuously operational for a desirably long length of time. Availability can be measured relative to “100% operational” or “never failing.” In information technology (IT), a widely-held but difficult-to-achieve standard of availability for a system or product is known as “five 9s” (99.999 percent) availability.
Availability experts emphasize that, for any system to be highly available, the parts of a system should be well-designed and thoroughly tested before they are used. Since a computer system or a network consists of many parts in which all parts usually need to be present in order for the whole to be operational, much planning for high availability centers around backup and failover processing and data storage and access.
How availability is measured
Typically, an availability percentage is calculated as follows:
Availability = (minutes in a month – minutes of downtime) * 100/minutes in a month
A service provider will typically provide availability metrics in their service level agreements (SLAs). Since system maintenance and planned downtime are a part of life, an HA system or system component is not expected to be available 100% of the time.
If the service level agreement for availability is 99.999%, the end user can expect the service to be unavailable for the following amounts of time:
Time system is unavailable
5 minutes and 15.6 seconds
To provide context, if a company adheres to the “three 9s” standard (99.9%), that means there will be about 8 hours and 45 minutes of system downtime during the course of one year. A “two 9s” standard for high availability is even more dramatic; 99% HA equals a little over three days of downtime in a year.
How to achieve high availability
A highly available system should be able to quickly recover from any sort of failure state to minimize interruptions for the end user. Best practices for achieving high availability include:
- Eliminate single points of failure, or any node that would impact the system as a whole if it becomes dysfunctional.
- Ensure that all systems and data are backed up for simple recovery.
- Use load balancing to distribute application and network traffic across servers or other hardware. A popular example of a load balancer is HAProxy.
- Continuously monitor the health of backend servers.
- Distribute resources geographically in case of power outages or natural disasters.
- Implement reliable crossover or failover In terms of storage, a redundant array of independent disks (RAID) or storage area network (SAN) are common approaches.
- Set up a system that detects failures as soon as they occur.
- Design system parts for high availability and test their functionality before implementation.
The role of backup and recovery in HA
Backups and failover processes are crucial components to accomplishing high availability. This can be attributed to the fact that some computer systems or networks consist of individual components, either hardware or software, that must be fully operational in order for the entire system to be available.
Backup components should be built into the infrastructure of the system. For example, if a server fails, an organization should be able to switch to a backup server. To obtain redundancy in a component, IT organizations should follow an N+1, N+2, 2N, 2N+1 strategy. These strategies ensure mission-critical software and hardware are given at least one component as a backup.
Ensuring there are data backups will help ensure high availability in the case of data loss, corruption or storage failures. A datacenter should be able to quickly recover from data loss for any reason to maintain high availability. An IT organization should enact automatic disaster recovery plans such as hosting data backups on redundant servers for data resilience.