Incorporating Fault Tolerance with Replication on Very Large Scale Grids (original) (raw)

Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the beginning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to increase fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the possible speedup of parallel process. We also derive the achievable speedup for some fundamental parallel algorithms using this technique.