Publications in Fault-Tolerance

All the work in predicate detection has applications to software fault-tolerance. This page describes work that is not in the area of predicate detection.

Controlled Reexecution

Controlled reexecution is a technique to bypass synchronization faults automatically. The program is executed in a lookahead mode. The resulting execution is analyzed for synchronization bugs. If it is found that some invariant is violated, the execution is analyzed to see if the invariant can be maintained by additional sychronization. The problem of determining whether a proper synchronization exists to maintain an arbitrary invariant is NP-complete. However, there are efficient solutions for many subclasses. Disjunctive predicates and mutual exclusion predicates are explored in

A. Tarafdar, V. K. Garg,Software fault-tolerance of concurrent programs using controlled reexecution,DISC'99 Bratislava, Slovakia, September 27-29, 1999, pp. 210 - 224.Abstract ...........Postscript

Optimal synchronization for disjunctive predicates and region predicates is explored in

Neeraj Mittal and Vijay K. Garg, Finding missing synchronization in a distributed computation using controlled re-execution, Distributed Computing, 17(2):107-130, 2004.
[ bib | http ]

Message Logging

Message logging is used to recover from faults in a distributed program. There are three main approaches to message logging --- optimistic, pessimistic and causal. The optimistic approach was pioneered by Strom and Yemini. Their approach, however, may result in exponential number of rollbacks on a single failure. We have developed a new algorithm for recovering asynchronously from failures in a distributed computation. The algorithm is completely asynchronous and can handle multiple failures and network partitioning. Further, it does not assume any message ordering, causes the minimum amount of rollback and restores the maximum recoverable state with low overhead. This is the most efficient algorithm known so far for asynchronous recovery in distributed systems with the optimistic approach.

O. Damani, V. K. Garg,``How to Recover Efficiently and Asynchronously when Optimism Fails,''Proc. IEEE International Conference on Distributed Computing Systems,Hong Kong, May 1996, pp. 108 - 115. Abstract ........... Tech Report pdf

A more general version called K-optimistic logging is described in

O. Damani, Yi-Min Wang and V. K. Garg, K-Optimistic Message Logging, Journal on Parallel and Distributed Computing .Volume 63, Issue 12, December 2003, Pages 1193-1218.link to ScienceDirect

When processes are multi-threaded, the optimistic message logging can be extended as shown in

Om P. Damani and Ashis Tarafdar and Vijay K. Garg,,Optimistic Recovery in Multi-threaded Distributed Systems,Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS),Lausanne, Switzerland, October 1999, pages 234 - 243.Abstract ...........Postscript

The following paper discusses recovery algorithms for causal message logging.

J. R. Mitchell and V.K. Garg,``A Non-Blocking Recovery Algorithm for Causal Message Logging,''Symposium on Reliable Distributed Systems (SRDS), October 1998.Postscript

Self-stabilizing Systems

Algorithms to maintain spanning trees in a completely connected networks is discussed in

Vijay K. Garg, Anurag Agarwal, Distributed Maintenance of a Spanning Tree Using Labeled Tree Encoding, Euro'Par 2005, pp. 606 - 616. pdf .... slides ...Technical Report .