A Fusion-based Approach for Handling Multiple Faults in Distributed Systems

Abstract

Given a set of n data structures, we present the concept of an (f,m)-fusion that can correct f crash faults or detect f Byzantine faults among them, using m additional backup structures. The traditional solution to this problem has been replication, which requires f copies of each data structure, resulting in a total of n.f backup structures. In this paper, we a present a solution based on Reed-Solomon erasure codes that requires just f additional backup structures. We provide efficient algorithms to backup up linked lists and list-based queues and present results comparing their performance with a simple replication-based solution. The results indicate that the fusion-based solutions are space efficient, though recovery is cheaper in replication. In systems with infrequent failures, this may be an acceptable compromise. Further, we explore the problem of correcting faults among the data structures, when sufficient servers are not available to host the backups. We prove that ceil [n/(n+k-f)].f backups are necessary and sufficient to correct f crash faults or detect f Byzantine faults among the data structures, when there are k additional servers available to host the backup.