AutoDiagn: An Automated Real-time Diagnosis Framework for Big Data Systems (original) (raw)

HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data Systems

Journal of Computer Science and Technology, 2019

With tremendous growing interests in Big Data, the performance improvement of Big Data systems becomes more and more important. Among many steps, the first one is to analyze and diagnose performance bottlenecks of the Big Data systems. Currently, there are two major solutions. One is the pure data-driven diagnosis approach, which may be very time-consuming; the other is the rule-based analysis method, which usually requires prior knowledge. For Big Data applications like Spark workloads, we observe that the tasks in the same stages normally execute the same or similar codes on each data partition. On basis of the stage similarity and distributed characteristics of Big Data systems, we analyze the behaviors of the Big Data applications in terms of both system and micro-architectural metrics of each stage. Furthermore, for different performance problems, we propose a hybrid approach that combines prior rules and machine learning algorithms to detect performance anomalies, such as straggler tasks, task assignment imbalance, data skew, abnormal nodes and outlier metrics. Following this methodology, we design and implement a lightweight, extensible tool, named HybridTune, and measure the overhead and anomaly detection effectiveness of HybridTune using the BigDataBench benchmarks. Our experiments show that the overhead of HybridTune is only 5%, and the accuracy of outlier detection algorithm reaches up to 93%. Finally, we report several use cases diagnosing Spark and Hadoop workloads using BigDataBench, which demonstrates the potential use of HybridTune.

HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems

arXiv (Cornell University), 2017

With tremendous growing interests in Big Data systems, analyzing and facilitating their performance improvement become increasingly important. Although there have much research efforts for improving Big Data systems performance, efficiently analysing and diagnosing performance bottlenecks over these massively distributed systems remain a major challenge. In this paper, we propose a spatio-temporal correlation analysis approach based on stage characteristic and distribution characteristic of Big Data applications, which can associate the multi-level performance data fine-grained. On the basis of correlation data, we define some priori rules, select features and vectorize the corresponding datasets for different performance bottlenecks, such as, workload imbalance, data skew, abnormal node and outlier metrics. And then, we utilize the data and model driven algorithms for bottlenecks detection and diagnosis. In addition, we design and develop a lightweight, extensible tool HybridTune, and validate the diagnosis effectiveness of our tool with BigDataBench on several benchmark experiments in which the outperform state-of-the-art methods. Our experiments show that the accuracy of abnormal/outlier detection we obtained reaches about 80%. At last, we report several Spark and Hadoop use cases, which are demonstrated how HybridTune supports users to carry out the performance analysis and diagnosis efficiently on the Spark and Hadoop applications, and our experiences demonstrate HybridTune can help users find the performance bottlenecks and provide optimization recommendations.

Ganesha: black-box fault diagnosis for MapReduce systems

Hot Metrics, 2008

Performance problems in distributed systems can be hard to diagnose and to localize to a specific node or a set of nodes. There are many challenges in problem localization (i.e., tracing the problem back to the culprit node) and root-cause analysis (i.e., tracing the problem further to the underlying code-level fault or bug, e.g., memory leak, deadlock). As we show, performance problems can originate at one node in the system and then start to manifest at other nodes as well, due to the inherent communication across components-this can make it hard to discover the original culprit node.

Kahuna: Problem diagnosis for mapreduce-based cloud computing environments

… (NOMS), 2010 IEEE, 2010

We present Kahuna, an approach that aims to diagnose performance problems in MapReduce systems. Central to Kahuna's approach is our insight on peer-similarity, that nodes behave alike in the absence of performance problems, and that a node that behaves differently is the likely culprit of a performance problem. We present applications of Kahuna's insight in techniques and their algorithms to statistically compare blackbox (OS-level performance metrics) and white-box (Hadooplog statistics) data across the different nodes of a MapReduce cluster, in order to identify the faulty node(s). We also present empirical evidence of our peer-similarity observations from the 4000-processor Yahoo! M45 Hadoop cluster. In addition, we demonstrate Kahuna's effectiveness through experimental evaluation of two algorithms for a number of reported performance problems, on four different workloads in a 100-node Hadoop cluster running on Amazon's EC2 infrastructure.

ASDF: Automated online fingerpointing for Hadoop

Parallel Data …, 2008

Localizing performance problems (or fingerpointing) is essential for distributed systems such as Hadoop that support long-running, parallelized, data-intensive computations over a large cluster of nodes. Manual fingerpointing does not scale in such environments because of the number of nodes and the number of performance metrics to be analyzed on each node. ASDF is an automated, online fingerpointing framework that transparently extracts and parses different time-varying data sources (e.g., sysstat , Hadoop logs) on each node, and implements multiple techniques (e.g., log analysis, correlation, clustering) to analyze these data sources jointly or in isolation. We demonstrate ASDF's online fingerpointing for documented performance problems in Hadoop, under different workloads; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time, and exhibits average online fingerpointing latencies of less than 1 minute with false-positive rates of less than 1%.

ASDF: An Automated, Online Framework for Diagnosing Performance Problems

Lecture Notes in Computer Science, 2010

Performance problems account for a significant percentage of documented failures in large-scale distributed systems, such as Hadoop. Localizing the source of these performance problems can be frustrating due to the overwhelming amount of monitoring information available. We automate problem localization using ASDF an online diagnostic framework that transparently monitors and analyzes different time-varying data sources (e.g., OS performance counters, Hadoop logs) and narrows down performance problems to a specific node or a set of nodes. ASDF's flexible architecture allows system administrators to easily customize data sources and analysis modules for their unique operating environments. We demonstrate the effectiveness of ASDF's diagnostics on documented performance problems in Hadoop; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time and achieves a balanced accuracy of 80% at localizing problems to the culprit node. Problem-diagnosis techniques tend to gather data about the system and/or the application to develop a priori templates of normal, problem-free system behavior; the techniques then detect performance problems by looking for anomalies in runtime data, as compared to the templates. Typically, these analysis techniques are run offline and post-process the data gathered from the system. The data used to develop the models and to perform the diagnosis can be collected in different ways. A white-box diagnostic approach extracts application-level data directly and requires instrumenting the application and possibly understanding the application's internal structure or semantics. A black-box diagnostic approach aims to infer application behavior by extracting data transparently from the operating system or network without needing to instrument the application or to understand its internal structure or semantics. Obviously, it might not be scalable (in effort, time and cost) or even possible to employ a white-box approach in production environments that contain many third-party services, applications and users. A black-box approach also has its drawbacks-while such an approach can infer application behavior to some extent, it might not always be able to pinpoint the root cause of a performance problem. Typically, a black-box approach is more effective at problem localization, while a white-box approach extracts more information to ascertain the underlying root cause of a problem. Hybrid, or grey-box, diagnostic approaches leverage the strengths of both white-box and black-box approaches. There are two distinct problems that we pursued. First, we sought to address support for problem localization (what we call fingerpointing) online, in an automated manner, even as the system under diagnosis is running. Second, we sought to address the problem of automated fingerpointing for Hadoop [1], an open-source implementation of the MapReduce programming paradigm [2] that supports long-running, parallelized, dataintensive computations over a large cluster of nodes. This chapter describes ASDF, a flexible, online framework for fingerpointing that addresses the two problems outlined above. ASDF has API support to plug in different time-varying data sources, and to plug in various analysis modules to process this data. Both the data-collection and the data-analyses can proceed concurrently, while the system under diagnosis is executing. The data sources can be gathered in either a black-box or white-box manner, and can be diverse, coming from application logs, system-call traces, system logs, performance counters, etc. The analysis modules can be equally diverse, involving time-series analysis, machine learning, etc. We demonstrate how ASDF automatically fingerpoints some of the performance problems in Hadoop that are documented in Apache's JIRA issue tracker [3]. Manual fingerpointing does not scale in Hadoop environments because of the number of nodes and the number of performance metrics to be analyzed on each node. Our current implementation of ASDF for Hadoop automatically extracts time-varying white-box and black-box data sources on every node in a Hadoop cluster. ASDF then feeds these data sources into different analysis modules (that respectively perform clustering, peercomparison or Hadoop-log analysis), to identify the culprit node(s), in real time. A unique aspect of our Hadoop-centric fingerpointing is our ability to infer Hadoop states (as we chose to define them) by parsing the logs that are natively auto-generated by Hadoop. We then leverage the information about the states and the time-varying statetransition sequence to localize performance problems.

SmartMonit: Real-time Big Data Monitoring System

SmartMonit: Real-time Big Data Monitoring System, 2019

Modern big data processing systems are becoming very complex in terms of large-scale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. In this paper, we propose SmartMonit, a real-time big data monitoring system, which collects infrastructure information such as the process status of each task. At the same time, we develop a real-time stream processing framework to analyze the coordination among the tasks and the infrastructures. This coordination information is essential for troubleshooting the reasons for failures and performance reduction, especially the ones propagated from other causes.

Ganesha: Black-Box Fault Diagnosis for MapReduce Systems (CMU-PDL-08-112)

2018

Ganesha aims to diagnose faults transparently in MapReduce systems, by analyzing OS-level metrics alone. Ganesha's approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. While our training is performed on smaller Hadoop clusters and for specific workloads, our approach allows us to diagnose faults in larger Hadoop clusters and for unencountered workloads. We also candidly highlight faults that escape Ganesha's black-box diagnosis.

Performance Diagnosis Using Bigtable Monitoring for Cloud Computing System

International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 2014

Reduplication services are allocated in individual physical nodes in the cloud. They can be amassed into various types of services, serving large amount of user requests. The existing diagnosis techniques for such distributed systems can’t effectively solve. In finegrained performance abnormality is still lacking. Production cloud systems should be completely unaided. CloudDiag systematically collects the point-to-point tracking data from each physical node in the cloud and classifies into different categories according to call tree. It then employs a customized Map-Reduce algorithm to proactively analyze the tracing data. Specifically, it assembles the tracing data of each user request, and classifies the tracing data into different categories according to call trees of the requests.