E-Book Overview
Paper, 19 p, IEEE. The key concepts and techniques used to build high availability computer systems are modularity, fail-fast modules, independent failure modes, redundancy, and repair. These ideas apply to hardware, to design, and to software. They also apply to tolerating operations faults and environmental faults. This article explains these ideas and assesses highavailability system trends.
E-Book Content
High Availability Computer Systems Jim Gray Digital Equipment Corporation 455 Market St., 7’th Floor San Francisco, CA. 94105
Daniel P. Siewiorek Department of Electrical Engineering Carnegie Mellon University Pittsburgh, PA. 15213
Abstract: The key concepts and techniques used to build high availability computer systems are (1) modularity, (2) fail-fast modules, (3) independent failure modes, (4) redundancy, and (5) repair. These ideas apply to hardware, to design, and to software. They also apply to tolerating operations faults and environmental faults. This article explains these ideas and assesses highavailability system trends. Overview It is paradoxical that the larger a system is, the more critical is its availability, and the more difficult it is to make it highly-available. It is possible to build small ultra-available modules, but building large systems involving thousands of modules and millions of lines of code is still an art. These large systems are a core technology of modern society, yet their availability are still poorly understood. This article sketches the techniques used to build highly available computer systems. It points out that three decades ago, hardware components were the major source of faults and outages. Today, hardware faults are a minor source of system outages when compared to operations, environment, and software faults. Techniques and designs that tolerate this broader class of faults are in their infancy. A Historical Perspective Computers built in the late 1950' s offered twelve-hour mean time to failure. A maintenance staff of a dozen full-time customer engineers could repair the machine in about eight hours. This failure-repair cycle provided 60% availability. The vacuum tube and relay components of these computers were the major source of failures; they had lifetimes of a few months. Therefore, the machines rarely operated for more than a day without interruption1. Many fault detection and fault masking techniques used today were first used on these early computers. Diagnostics tested the machine. Self-checking computational techniques detected faults while the computation progressed. The program occasionally saved (checkpointed) its state on stable media.
After a failure, the program read the most recent checkpoint, and
High Availability Paper for IEEE Computer Magazine Draft
1
continued the computation from that point. This checkpoint/restart technique allowed longrunning computations to be performed by machines that failed every few hours. Device improvements have improved computer system availability. By 1980, typical well-run computer systems offered 99% availability2. This sounds good, but 99% availability is 100 minutes of downtime per week. Such outages may be acceptable for commercial back-office computer systems that process work in asynchronous batches for later reporting. Mission critical and online applications cannot tolerate 100 minutes of downtime per week. They require highavailability systems – ones that deliver 99.999% availability. This allows at most five minutes of service interruption per year. Process control, production control, and transaction processing applications are the principal consumers of the new class of high-availability systems. Telephone networks, airports, hospitals, factories, and stock exchanges cannot afford to stop because of a computer outage. In these applications, outages translate