Fault Tolerance for Real-Time Systems: Analysis and Optimization of Roll-back Recovery with Checkpointing

Detta är en avhandling från Lund University

Sammanfattning: Popular Abstract in English In the modern world of today, we are surrounded with many electronic devices which offer previously unseen performance. These devices constitute a large part of the everyday consumer electronics such as laptops, tablets, smart phones, etc., but are also used in wide variety of domains such as automotive industry, avionics, medicine, etc. The constant demand for high performance has resulted in a rapid development of semiconductor technologies. Technology scaling has pushed the boundaries enabling fabrication of miniature devices. With such miniature devices, it is possible to integrate an entire system on a single chip, commonly referred to as System-on-Chip. For example, in a recent smart phone, the size of such chip is less than one square centimeter, and within this area, the number of transistors (the fundamental building block of modern electronic devices) is over one billion. This example shows that the gains of technology scaling are enormous. However, this comes at a cost, and that is that devices manufactured in the latest technologies may be affected by errors which cause malfunction. Lately, soft errors have been named as one of the most serious threats to computer systems designed in the latest semiconductor technologies. Soft errors occur as a result of a particular type of faults, known as transient faults. Transient faults have a limited lifetime, namely these faults occur, remain present for a short time, but disappear afterwards. However, these faults, despite their short duration, often result in soft errors that may lead to a system failure. Therefore, soft errors have a significant impact on the reliability of the computer systems manufactured in the latest semiconductor technologies. While in the past soft errors were only a threat for devices which operate in harsh environments such as nuclear plants where high level of radiation exists, or avionics where at higher altitudes the cosmic radiation is higher, nowadays soft errors are a threat for all devices irrespective of the operational environment. The reason for this is that the major source of soft errors is the radiation of alpha-particles which are emitted from the package of the device itself. In the digital world where everything revolves around bits (a binary digit ``1'' or ``0''), a transient fault can be interpreted as the outside force that flips a bit from ``1'' to ``0'', or vice-versa, and the flipped bit is the representation of a soft error. If a soft error occurs in your smart phone, tablet or laptop you easily handle it by restarting the device. However, what happens if such an error occurs in a vital component of an airplane or a nuclear reactor? Can we simply restart? The field of research which tries to give answers to the previous questions is called Fault Tolerance. As the name suggests, fault tolerance enables correct operation of a device even in the presence of faults (errors). As a research topic, fault tolerance has been established along with the rise of the very first devices used in safety-critical applications such as avionics. Lately, the popularity of fault tolerance has been increased, especially when the manufacturing process has moved down to deep sub-micron semiconductor technologies where the size of the transistors has shrunk substantially, and their operation has become more susceptible to soft errors. To enable correct operation in the presence of errors, fault tolerance provides techniques that are capable of error-detection, i.e. detect the presence of errors, and error-recovery, i.e. recover the system from errors. Usually, this is achieved by introducing a hardware and time redundancy. Hardware redundancy techniques cope with errors by designing devices or computer systems, such that multiple copies (replicas) of the physical building blocks are used. Every block performs a given operation and provides some kind of an output based on some inputs. If two identical copies process the same inputs, it is expected that they would both produce the same outputs. While these techniques ensure correct operation in the presence of errors, the main drawback is that these techniques are rather expensive. The cost of a device which contains multiple replicas is higher. In contrast to the expensive hardware redundancy techniques, time redundancy techniques cope with errors by repeating the same operation utilizing the given hardware resources. The correct output is obtained by repeating the same operation at least twice. This increases the time required to obtain the final outcome, and therefore it results in much higher time overhead. Roll-back Recovery with Checkpointing (RRC) is a well-known fault tolerance technique that efficiently copes with soft errors. Unlike traditional time redundancy techniques, where upon error detection the program is restarted from the beginning, RRC stores checkpoints (intermediate states of the execution of the program), and when errors are detected, it forces the program to roll-back to the latest stored checkpoint. The advantage of this technique over other fault tolerance techniques is that it does not require a substantial amount of hardware redundancy. However, the major drawback of RRC is that it introduces time overhead that depends on the number of checkpoints that are used. Thus, RRC introduces a time overhead that may have a negative impact on the computer system where it is used. In general, computer systems are classified into real-time systems (RTSs) and non-RTSs, depending on the requirement to meet time constraints. For RTSs, the correct operation is defined as producing the correct output while satisfying a given time constraint (deadline). Depending on the consequences when deadlines are violated, RTSs are divided into soft and hard RTS. For soft RTSs, the consequences are not very severe. One example of a soft RTS can be a mobile phone where eventual deadline violation results in a dropped call. On the other hand, violating the deadlines in hard RTSs usually results in catastrophic consequences. An example of a hard RTS can be the braking control in a vehicle. RTSs are also affected by soft errors, and therefore there is a need to employ fault tolerance in RTSs as well. However, special consideration should be taken when employing fault tolerance in RTSs, due to the fact that fault tolerance usually introduces a time overhead. The time overhead due to usage of fault tolerance in RTSs, may result in a missed deadline. To mitigate this effect, it is important to optimize the usage of fault tolerance in RTSs. The optimization objectives differ among soft and hard RTSs. For soft RTSs, where eventual deadline violation results in some performance degradation, it is more important to minimize the average execution time (the average time needed for the operation to complete), while for hard RTSs, where it is crucial to meet the deadlines, it is more important to maximize the probability that the deadlines are met. During the early design stage of an RTS, a designer of an RTS receives a specification of the RTS that is to be implemented. During this stage, the designer needs to explore different fault tolerance techniques and choose the one that satisfies the given specification requirements. To assist the designer in the decision making process, in this thesis, we provide an optimization framework for RRC when used in RTSs. By using this framework, the designer of an RTS can first decide if RRC is a suitable fault tolerance technique for the RTS that is to be implemented, and then if RRC is applicable, the designer can acquire knowledge on the number of checkpoints that need to be used and how these checkpoints need to be distributed. The proposed optimization framework considers multiple optimization objectives that are important for RTSs. In particular, for soft RTSs the optimization framework considers optimization of RRC with respect to AET. For hard RTSs, the optimization framework considers optimization of RRC with the goal to maximize the Level of Confidence (LoC), i.e. the probability that the deadlines are met. Since a specification of an RTS that is to be implemented may include some reliability requirements, in this thesis, we have introduced the concept of Guaranteed Completion Time, i.e. a completion time that satisfies a given reliability (LoC) constraint. The Guaranteed Completion Time varies with the number of checkpoints used in RRC. Therefore, the optimization framework considers optimization of RRC with respect to Guaranteed Completion Time.