Managing a Changing Information Environment While Achieving High Availability and Fault Tolerance

The swiftly changing needs of an agile enterprise demand equally rapid changes to its information infrastructure.  Small businesses and start-ups especially must meet the technology requirements of new business opportunities quicker than their larger competitors.  Yet these organizations also face increased risk from infrastructure shifts:  Any service interruption could be disastrous to an underdog's reputation.  A system administrator's purpose is to provide end-users the services they need, while assuring the integrity, confidentiality, and availability of those services.  This essay will address  strategies for achieving a high degree of availability in a changing environment while mitigating associated risks.

The goals of high availability and fault tolerance must be incorporated into the initial design of any system that hopes to achieve them.  Integral to such a design is the selection of appropriate protocols and standards.  DNS and HTTP are two examples of stateless application layer protocols.  They follow a simple request/response cycle and are very tolerant of intermittent connectivity.  This robust implementation is contributing to the increasing popularity of web service architectures like SOAP, JSON-RPC, and XML-RPC.

Protection against hardware failure may be achieved for web applications and other stateless services through conventional clustering solutions.  For LAMP stacks and other open-source solutions, Red Hat's clustering solution and the Linux-HA project are two common implementations.  When combined with load balancing software such as LVS, a Storage Area Network, and a relational database engine capable of distributed consensus such as MySQL Cluster, these technologies provide a stable and flexible platform insulated from failures of physical hardware.  Incoming requests may be handled by a front-end load balancing appliance, and forwarded to back-end web servers.  If the web application servers share session data, then a high degree of fault tolerance is also provided, as any HTTP request made to a failed node may be retried and served by other machines in the cluster.

Unfortunately not all services are carried by such robust connectionless protocols.  File sharing and database connections are common examples of stateful connections.  Many implementations of the CIFS protocol (Windows file sharing) are particularly vulnerable to dropped TCP connections, sometimes requiring a reboot of the client machine to recover from an indeterminate state.  In my experience, the most effective method to protect these vulnerable connections against faults is leveraging virtualization.  Next-generation virtualization solutions such as Citrix XenServer and VMWare ESX are able to straddle virtual machines across multiple physical machines.  This duplication extends to all levels of the network stack above the physical layer, and completely insulates the virtual machine's integrity against failure of either host system.

All of the strategies discussed thus far mitigate the risk of sudden, unexpected failures in application platforms.  However, these platforms must also evolve to meet the needs of the business.  Indeed, many swiftly growing organizations may face more scheduled downtime caused by migrations than unscheduled interruptions due to unplanned failures.  These shortcomings may be avoided by building redundancy and abstractions into a system's network stack.  An example migration strategy would be forwarding client requests from a legacy system to its replacement, perhaps using network address translation or an application-level proxy service.  Transparently redirecting connections at a low level avoids inconsistent delays caused by DNS hostname or BGP route propagation.

Every service architecture should be further reinforced by other, internal services.  Any information relied upon by the service must be archived frequently.  These backups must reside on separate physical media to negate the risk of hardware failure, and on separate logical volumes to mitigate the risks of accidental deletion or malicious access.  Also, service availability must be monitored for interruptions.  Network monitoring solutions alert administrators to unexpected changes in system behavior, and provide meaningful performance metrics.  Larger organizations should also develop disaster recovery policies to ensure business continuity during severe service disruptions.  These strategies must be tested frequently, and updated to reflect any infrastructure changes.  Any service architecture which lacks backup, monitoring, and disaster recovery solutions is not worthy of the title “enterprise class”.

Many systems, like power grids, bridges, and other physical infrastructure are engineered to last forever.  Information technology systems stand in stark contrast, rarely surviving longer than a decade before being rendered obsolete by new techniques and changing needs.  Yet migrations away from existing systems need not be traumatic, so long as those original systems were designed with flexible layers of abstraction.  These abstractions also provide opportunities for redundancy, enabling high availability and fault tolerance crucial to an agile enterprise.