Fault tolerance

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. Any decrease in operating quality is proportional to the severity of the failure, unlike a naively designed system in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability to maintain functionality when portions of a system break down is referred to as graceful degradation.[1]

A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely when some part of the system fails.[2] The term is most commonly used to describe computer systems designed to continue operation in the event of a partial failure, with perhaps a reduction in throughput or an increase in response time. That is, the system as a whole is not stopped due to problems either in the hardware or the software. Non-computing examples include a motor vehicle designed to remain drivable if one of the tires is punctured, or a structure that retains its integrity despite damage caused by fatigue, corrosion, manufacturing flaws, or impact.

For an individual system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.

  1. ^ Adaptive Fault Tolerance and Graceful Degradation, Oscar González et al., 1997, University of Massachusetts - Amherst
  2. ^ Johnson, B. W. (1984). "Fault-Tolerant Microprocessor-Based Systems", IEEE Micro, vol. 4, no. 6, pp. 6–21

© MMXXIII Rich X Search. We shall prevail. All rights reserved. Rich X Search