Next: Supervisory Control of Discrete Up: research Previous: Observation of Distributed Programs

Fault-Tolerance in Distributed Systems

Some specific accomplishments include:

Scalable Algorithms for Global Snapshots in Distributed Systems Existing algorithms for global snapshots in distributed systems are not scalable when the underlying topology is complete. In a network with N processors, these algorithms require O(N) space and O(N) messages per processor. As a result, these algorithms are not efficient in large systems when the logical topology of the communication layer such as MPI is complete. We have developed three algorithms for global snapshot: a grid-based, a tree-based and a centralized algorithm. The grid-based algorithm uses O(N) space but only O(N) messages per processor. The tree-based algorithm requires only O(1) space and O(N w) messages per processor where w is the average number of messages in transit per processor. The centralized algorithm requires only O(1) space and O(w) messages per processor. We also have a matching lower bound for this problem. We have implemented our algorithms on top of the MPI library on the Blue Gene/L supercomputer. Our experiments confirm that the proposed algorithms significantly reduce the message and space complexity of a global snapshot. (ACM ICS 06)
Asynchronous Recovery in distributed programs: Developed a new algorithm for recovering asynchronously from failures in a distributed computation. The algorithm is completely asynchronous and can handle multiple failures and network partitioning, Further, it does not assume any message ordering, causes the minimum amount of rollback and restores the maximum recoverable state with low overhead. This is the most efficient algorithm known so far for asynchronous recovery in distributed systems with the optimistic approach. (IEEE ICDCS 96, 97, JPDC 2003).
Self-stabilizing Systems: Developed a new algorithm to maintain a spanning tree in a network based on Neville Code. This algorithm requires fewer messages for dense networks. (EuroPar 2005)
Control of Global Predicates One existing technique to tolerate synchronization faults is to roll back the program to a previous state and re-execute. Since synchronization failures are usually transient in nature, they may not occur in the re-execution. Instead of relying on chance, our approach is to control the re-execution in an attempt to avoid a recurrence of the synchronization failure. The control is achieved by tracing information during an execution and using this information to add synchronizations during the re-execution. This approach gives rise to the predicate control problem, which takes a computation and a property specified on the computation, and outputs a controlled computation that maintains the property. We solve the predicate control problem for the mutual exclusion property, which is important in synchronization fault tolerance. Our solutions include simple mutual exclusion, as well as read-write mutual exclusion, independent mutual exclusion, and independent read-write mutual exclusion. We demonstrate how the techniques can be applied to tolerate synchronization faults. Furthermore, the techniques have other application domains such as concurrent debuggers and process control systems. (Disc 99, JPDC 03)
Optimal Synchronization We define a class of global predicates, called region predicates, that can be controlled efficiently in a distributed computation. We prove that the synchronization generated by our algorithm is optimal. Further, we introduce the notion of an admissible sequence of events and prove that it is equivalent to the notion of predicate control. We then give an efficient algorithm for the class of disjunctive predicates based on the notion on an admissible sequence. (PODC 2000, DC 2004)

Next: Supervisory Control of Discrete Up: research Previous: Observation of Distributed Programs

Vijay K. Garg 2006-08-10