PDSL

Parallel and Distributed Systems Lab
The University of Texas at Austin

Current Research

The internet connects millions of computers together. Applications that run on multiple computers connected by the internet are called distributed systems. Currently, the research efforts at PDSL are focussed in the following directions:

Accurate Byzantine Agreement with Feedback
The Byzantine Agreement (BA) problem requires non-faulty processes to agree on a common value. In many applications, it is important that the processes agree on the correct value. In this project, we present a problem called Accurate Byzantine Agreement with Feedback (ABAF) in which all processes receive common feedback from the environment indicating if the value they agreed upon was correct or not. We present an algorithm that solves the ABAF problem based on a standard solution to the BA problem and a multiplicative method.

Publications: PODC 2011.
Students: John Bridgman, Bharath Balasubramanian.

Weighted Byzantine Agreement
In this project we explore a weighted version of the Byzantine Agreement Problem and its solution under various conditions. In this version, each machine is assigned a weight depending on the application. Instead of assuming that at most f out of N machines fail, the algorithm assumes that the total weight of the machines that fail is at most f/N. By using weights, the system can reach consensus in the presence of Byzantine failures, even when more than N/3 processes fail, so long as the total weight of the failed processes is less than 1/3.

Publications: IPDPS 2011.
Students: John Bridgman.

Fault Tolerance in Data Structures
The paper describes a technique to correct faults in large data structures hosted on distributed servers, based on the concept of fused backups. The prevalent solution to this problem is replication. To correct f crash faults (dead/unresponsive data structures) among n distinct data structures, replication requires f replicas of each data structure, resulting in nf additional backups. We present a solution, that uses a combination of error/erasure correcting codes and selective replication to correct f faults using just f additional backups.

Publications: ICDCS 2011, DISC 2010, DISC 2007.
Students: Bharath Balasubramanian, Vinit Ogale.

Fault Tolerance in Finite State Machines
Given a set of n different deterministic finite state machines (DFSMs) modeling a distributed system, we examine the problem of tolerating f crash or Byzantine faults in such a system. The traditional approach to this problem involves replication and requires nf backup DFSMs for crash faults and 2nf backup DFSMs for Byzantine faults. We present a generic approach called (f,m)-fusion that permits lesser number of backups than the replication. We show that the state space required by our approach is far lesser than that of replication.

Publications: IPDPS 2009, ICDCN 2008.
Students: Bharath Balasubramanian, Vinit Ogale.

Combining Replication with Error Correcting Codes

We are working on a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as savings in storage and mes- sages during normal operations. Fused state machines may incur higher overhead during recovery from crash or Byzantine faults, but that may be acceptable if the probability of fault is low. For crash faults, we give an algorithm that requires the optimal f backup state machines for tolerating f faults in the sys- tem of n machines. For Byzantine faults, we propose an algorithm that requires only nf + f additional state machines, as opposed to 2nf state machines.

Software Fault-tolerance of Distributed Programs

How to ensure that applcations run proplerly even when one or more computers malfunction? We are currently working on a NSF funded project in this area. We have developed efficient techniques for tracking dependency in distributed systems, detecting stable and unstable predicates, controlling distributed computations, etc.

Software Infrastructure for the Internet Applications

How to let common users write Internet applications? How to harness computing power of multiple computers? We are currently working on a project funded by Texas Higher Education Coordinating Board for developing a distributed computing platform for applications in Chemistry (analyzing catalysts). This project is joint with Dr. Henkelman in the Department of Chemistry.

Model Checking of Distributed Programs

How can one verify the correctness of distributed programs. We have developed a tool called TC-SPIN that verifies correctness of a distributed program without explicit global state enumeration. We have also developed a runtime verification tool called POTA that verifies a single execution of a distributed program. We are currently working on a project funded by Semiconductor Research Consortium (SRC) for verification of concurrent hardware.

Distributed Debugging

How to identify faults in distributed programs? We have developed algorithms that allow efficient obervation and control of distributed programs. This project has been funded by NSF.