Publications in Fault-Tolerance
All the work in predicate detection has applications to software fault-tolerance.
This page describes work that is not in the area of predicate detection.
Controlled Reexecution
Controlled reexecution is a technique to bypass synchronization faults automatically.
The program is executed in a lookahead mode. The resulting execution is analyzed
for synchronization bugs. If it is found that some invariant is violated, the execution
is analyzed to see if the invariant can be maintained by additional sychronization.
The problem of determining whether a proper synchronization exists to maintain
an arbitrary invariant is NP-complete. However, there are efficient solutions
for many subclasses. Disjunctive predicates and mutual exclusion predicates
are explored in
A. Tarafdar, V. K. Garg,Software fault-tolerance of concurrent programs using controlled reexecution,DISC'99 Bratislava, Slovakia, September 27-29, 1999, pp. 210 - 224.Abstract ...........Postscript
Optimal synchronization for disjunctive predicates and region predicates
is explored in
Neeraj Mittal and Vijay K. Garg,
Finding missing synchronization in a distributed computation using
controlled re-execution,
Distributed Computing, 17(2):107-130, 2004.
[ bib |
http ]
Message Logging
Message logging is used to recover from faults in a distributed program.
There are three main approaches to message logging --- optimistic, pessimistic
and causal. The optimistic approach was pioneered by Strom and Yemini.
Their approach, however, may result in exponential number of rollbacks on a single failure.
We have developed a new algorithm for recovering asynchronously from failures in a
distributed computation.
The algorithm is completely asynchronous and can
handle multiple failures and network partitioning.
Further, it does not assume any message ordering, causes the minimum amount of
rollback and restores the maximum recoverable state with low overhead.
This is the most efficient
algorithm known so far for asynchronous recovery in distributed systems with
the optimistic approach.
O. Damani, V. K. Garg,``How to Recover Efficiently and Asynchronously when Optimism Fails,''Proc. IEEE International Conference on Distributed Computing Systems,Hong Kong, May 1996, pp. 108 - 115. Abstract ........... Tech Report pdf
A more general version called K-optimistic logging is described in
O. Damani, Yi-Min Wang and V. K. Garg, K-Optimistic Message Logging, Journal on Parallel and Distributed Computing .Volume 63, Issue 12, December 2003, Pages 1193-1218.link to ScienceDirect
When processes are multi-threaded, the optimistic message logging can be extended as
shown in
Om P. Damani and Ashis Tarafdar and Vijay K. Garg,,Optimistic Recovery in Multi-threaded Distributed Systems,Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS),Lausanne, Switzerland, October 1999, pages 234 - 243.Abstract ...........Postscript
The following paper discusses recovery algorithms for causal message logging.
J. R. Mitchell and V.K. Garg,``A Non-Blocking Recovery Algorithm for Causal Message Logging,''Symposium on Reliable Distributed Systems (SRDS), October 1998.Postscript
Self-stabilizing Systems
Algorithms to maintain spanning trees in a completely connected networks is discussed
in
Vijay K. Garg, Anurag Agarwal, Distributed Maintenance of a Spanning Tree Using Labeled Tree Encoding, Euro'Par 2005, pp. 606 - 616. pdf .... slides ...Technical Report .