Byzantine fault-tolerant MapReduce: Faults are not just crashes (original) (raw)

On the Performance of Byzantine Fault-Tolerant MapReduce

IEEE Transactions on Dependable and Secure Computing, 2000

MapReduce is often used for critical data processing, e.g., in the context of scientific or financial simulation. However, there is evidence in the literature that there are arbitrary (or Byzantine) faults that may corrupt the results of MapReduce without being detected. We present a Byzantine fault-tolerant MapReduce framework that can run in two modes: non-speculative and speculative. We thoroughly evaluate experimentally the performance of these two versions of the framework, showing that they use around twice more resources than Hadoop MapReduce, instead of the three times more of alternative solutions. We believe this cost is acceptable for many critical applications.

On the Feasibility of Byzantine Fault-Tolerant MapReduce in Clouds-of-Clouds

2012

Abstract—MapReduce is a framework for processing large data sets largely used in cloud computing. MapReduce implementations like Hadoop can tolerate crashes and file corruptions, but there is evidence that general arbitrary faults do occur and can affect the correctness of job executions. Furthermore, many individual cloud outages have been reported, raising concerns about depending on a single cloud.

Fault Tolerance in MapReduce: A Survey

Computer Communications and Networks, 2016

Data-intensive computing has become one of the most popular forms of parallel computing. This is due to the explosion of digital data we are living. This data expansion has mainly come from three sources: (i) scientific experiments from fields such as astronomy, particle physics, or genomics; (ii) data from sensors; and (iii) citizens publications in channels such as social networks. Data-intensive computing systems, such as Hadoop MapReduce, have as main goal the processing of an enormous amount of data in a short time, by transmitting the computation where the data resides. In failure-free scenarios, these frameworks usually achieve good results. Given that failures are common at large scale, these frameworks exhibit some fault tolerance and dependability techniques as built-in features. In particular, MapReduce frameworks tolerate machine failures (crash failures) by re-executing all the tasks of the failed machine by the virtue of data replication. Furthermore, in order to mask temporary failures caused by network or machine overload (timing failure) where some tasks are performing relatively slower than other tasks, Hadoop relaunches other copies of these tasks on other machines.

Medusa: An Efficient Cloud Fault-Tolerant MapReduce

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2016

Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using thousands of machines in a single cloud, also needs to be scaled out to multiple clouds to adapt to this evolution. The challenge of building a multi-cloud distributed architecture is substantial. Notwithstanding, the ability to deal with the new types of faults introduced by such setting, such as the outage of a whole datacenter or an arbitrary fault caused by a malicious cloud insider, increases the endeavor considerably. In this paper we propose Medusa, a platform that allows MapReduce computations to scale out to multiple clouds and tolerate several types of faults. Our solution fulfills four objectives. First, it is transparent to the user, who writes her typical MapReduce application without modification. Second, it does not require any modification to the widely used Hadoop framework. Third, the proposed system goes well beyond the fault-tolerance offered by MapReduce to tolerate arbitrary faults, cloud outages, and even malicious faults caused by corrupt cloud insiders. Fourth, it achieves this increased level of fault tolerance at reasonable cost. We performed an extensive experimental evaluation in the ExoGENI testbed, demonstrating that our solution significantly reduces execution time when compared to traditional methods that achieve the same level of resilience.

IJERT-Design and Optimization of Secure Byzantine Fault-Tolerant MapReduce on Large Cluster

International Journal of Engineering Research and Technology (IJERT), 2014

https://www.ijert.org/design-and-optimization-of-secure-byzantine-fault-tolerant-mapreduce-on-large-cluster https://www.ijert.org/research/design-and-optimization-of-secure-byzantine-fault-tolerant-mapreduce-on-large-cluster-IJERTV3IS20391.pdf Most Byzantine fault-tolerant state machine replication (BFT) algorithms have a primary replica that is in change of ordering the clients requests. Recently it was shown that this dependence allows a faulty primary to degrade the performance of the system to a small fraction of what the environment allows. In this paper we present Kerberos-based model with tokens for data blocks and processing nodes. We also especially interested in the performance of Byzantine fault-tolerant MapReduce framework that can run in two modes: (a) non-speculative (b) speculative. We designed the framework that they used around twice more resource instead of three times of alternative solutions. This novel mode of operation deals with those attacks at much lower cost.

Chrysaor: Fine-Grained, Fault-Tolerant Cloud-of-Clouds MapReduce

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2017

MapReduce is a framework for processing large data sets much used in the context of cloud computing. MapReduce implementations like Hadoop can tolerate crashes and file corruptions, but not arbitrary faults. Unfortunately, there is evidence that arbitrary faults do occur and can affect the correctness of MapReduce job executions. Furthermore, many outages of major cloud offerings have been reported, raising concerns about the dependence on a single cloud. In this paper we propose a novel execution system that allows to scale out MapReduce computations to a cloud-of-clouds and tolerate arbitrary faults, malicious faults, and cloud outages. Our system, Chrysaor, is based on a fine-grained replication scheme that tolerates faults at the task level. Our solution has three important properties: it tolerates the above-mentioned classes of faults at reasonable cost; it requires minimal modifications to the users' applications; and it does not involve changes to the Hadoop source code. We performed an extensive evaluation of our system in Amazon EC2, showing that our fine-grained solution is efficient in terms of computation by recovering only faulty tasks. This is achieved without incurring a significant penalty for the baseline case (i.e., without faults) in most workloads.

Method for testing the fault tolerance of MapReduce frameworks

Computer Networks, 2015

A MapReduce framework abstracts distributed system issues, integrating a distributed file system with an application's needs. However, the lack of determinism in distributed system components and reliability in the network may cause applications errors that are difficult to identify, find, and correct. This paper presents a method to create a set of fault cases, derived from a Petri net (PN), and a framework to automate the execution of these fault cases in a distributed system. The framework controls each MapReduce component and injects faults according to the component's state. Experimental results showed the fault cases are representative for testing Hadoop, a MapReduce implementation. We tested three versions of Hadoop and identified bugs and elementary behavioral differences between the versions. The method provides network reliability enhancements as a byproduct because it identifies errors caused by a service or system bug instead of simply assigning them to the network.

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques

Sensors

Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% re...

Ganesha: black-box fault diagnosis for MapReduce systems

Hot Metrics, 2008

Performance problems in distributed systems can be hard to diagnose and to localize to a specific node or a set of nodes. There are many challenges in problem localization (i.e., tracing the problem back to the culprit node) and root-cause analysis (i.e., tracing the problem further to the underlying code-level fault or bug, e.g., memory leak, deadlock). As we show, performance problems can originate at one node in the system and then start to manifest at other nodes as well, due to the inherent communication across components-this can make it hard to discover the original culprit node.

bft-mapreduce-cloudsofclouds-discco2012.pdf

MapReduce is a framework for processing large data sets largely used in cloud computing. MapReduce implementations like Hadoop can tolerate crashes and file corruptions, but there is evidence that general arbitrary faults do occur and can affect the correctness of job executions. Furthermore, many individual cloud outages have been reported, raising concerns about depending on a single cloud.