Abstract

Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution

Ashis Tarafdar and Vijay K. Garg.

Abstract

Concurrent programs often encounter failures, such as races, owing to the presence of synchronization faults (bugs). The growing prevalence of concurrent programs and the need to maintain availability in spite of failures make the problem of tolerating synchronization faults increasingly important. One existing technique to tolerate synchronization faults is to roll back the program to a previous state and re-execute. Since synchronization failures are usually transient in nature, they may not occur in the re-execution. Instead of relying on chance, our approach is to control the re-execution in an attempt to avoid a recurrence of the synchronization failure. The control is achieved by tracing information during an execution and using this information to add synchronizations during the re-execution.

This approach gives rise to a general problem, called the predicate control problem, which takes a computation and a property specified on the computation, and outputs a controlled computation that maintains the property. We solve the predicate control problem for the mutual exclusion property, which is important in synchronization fault tolerance. Our solutions include simple mutual exclusion, as well as read-write mutual exclusion, independent mutual exclusion, and independent read-write mutual exclusion. We demonstrate how the techniques can be applied to tolerate synchronization faults. Furthermore, the techniques have other application domains such as concurrent debuggers and process control systems.