Optimistic Recovery in Multi-threaded Distributed Systems.
Om P. Damani, Ashis Tarafdar and Vijay K. Garg.
Abstract
We address the problem of recovering multi-threaded distributed systems from process crash failures. Although recovery has been widely studied in the context of traditional non-threaded distributed systems, extending those solutions to the multi-threaded scenario presents new problems. We identify and address these problems for optimistic logging protocols.
There are two natural extension to optimistic logging protocols in the multi-threaded scenario. The first extension is process-centric and logs the points of internal non-determinism caused by threads in a process. The second extension is thread-centric and treats each thread as a separate process. The process-centric approach suffers from false causality while the thread-centric approach suffers from high causality tracking overhead. By observing that the granularity of failures can be different from the granularity of rollbacks, we eliminate this trade-off by a new balanced approach which allows low causality tracking overhead without false causality.