The internet connects millions of computers together. Applications that run on multiple computers connected by the internet are called distributed systems. Currently, I and my students at PDSL are focussed towards research in the areas of:
Combining Replication with Error Correcting CodesWe are working on a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as savings in storage and mes- sages during normal operations. Fused state machines may incur higher overhead during recovery from crash or Byzantine faults, but that may be acceptable if the probability of fault is low. For crash faults, we give an algorithm that requires the optimal f backup state machines for tolerating f faults in the sys- tem of n machines. For Byzantine faults, we propose an algorithm that requires only nf + f additional state machines, as opposed to 2nf state machines.
Software Fault-tolerance of Distributed ProgramsHow to ensure that applcations run proplerly even when one or more computers malfunction? We are currently working on a NSF funded project in this area. We have developed efficient techniques for tracking dependency in distributed systems, detecting stable and unstable predicates, controlling distributed computations, etc.
Software Infrastructure for the Internet ApplicationsHow to let common users write Internet applications? How to harness computing power of multiple computers? We are currently working on a project funded by Texas Higher Education Coordinating Board for developing a distributed computing platform for applications in Chemistry (analyzing catalysts). This project is joint with Dr. Henkelman in the Department of Chemistry.
Model Checking of Distributed ProgramsHow can one verify the correctness of distributed programs. We have developed a tool called TC-SPIN that verifies correctness of a distributed program without explicit global state enumeration. We have also developed a runtime verification tool called POTA that verifies a single execution of a distributed program. We are currently working on a project funded by Semiconductor Research Consortium (SRC) for verification of concurrent hardware.
Distributed DebuggingHow to identify faults in distributed programs? We have developed algorithms that allow efficient obervation and control of distributed programs. This project has been funded by NSF.