Optimizing Fault Tolerance for Real-Time Systems

Detta är en avhandling från Linköping : Linköping University Electronic Press

Sammanfattning: For the vast majority of computer systems correct operation is defined as producing the correct result within a time constraint (deadline). We refer to such computer systems as real-time systems (RTSs). RTSs manufactured in recent semiconductor technologies are increasingly susceptible to soft errors, which enforces the use of fault tolerance to detect and recover from eventual errors. However, fault tolerance usually introduces a time overhead, which may cause an RTS to violate the time constraints. Depending on the consequences of violating the deadlines, RTSs are divided into hard RTSs, where the consequences are severe, and soft RTSs, otherwise. Traditionally, worst case execution time (WCET) analyses are used for hard RTSs to ensure that the deadlines are not violated, and average execution time (AET) analyses are used for soft RTSs. However, at design time a designer of an RTS copes with the challenging task of deciding whether the system should be a hard or a soft RTS. In such case, focusing only on WCET analyses may result in an over-designed system, while on the other hand focusing only on AET analyses may result in a system that allows eventual deadline violations. To overcome this problem, we introduce Level of Confidence (LoC) as a metric to evaluate to what extent a deadline is met in presence of soft errors. The advantage is that the same metric can be used for both soft and hard RTSs, thus a system designer can precisely specify to what extent a deadline is to be met. In this thesis, we address optimization of Roll-back Recovery with Checkpointing (RRC) which is a good representative for fault tolerance due to that it enables detection and recovery of soft errors at the cost of introducing a time overhead which impacts the execution time of tasks. The time overhead depends on the number of checkpoints that are used. Therefore, we provide mathematical expressions for finding the optimal number of checkpoints which leads to: 1) minimal AET and 2) maximal LoC. To obtain these expressions we assume that error probability is given. However, error probability is not known in advance and it can even vary over runtime. Therefore, we propose two error probability estimation techniques: Periodic Probability Estimation and Aperiodic Probability Estimation that estimate error probability during runtime and adjust the RRC scheme with the goal to reduce the AET. By conducting experiments, we show that both techniques provide near-optimal performance of RRC.

  HÄR KAN DU HÄMTA AVHANDLINGEN I FULLTEXT. (följ länken till nästa sida)