CLUEBOX: A Performance Log Analyzer for Automated Troubleshooting (original) (raw)

ASDF: An Automated, Online Framework for Diagnosing Performance Problems

Lecture Notes in Computer Science, 2010

Performance problems account for a significant percentage of documented failures in large-scale distributed systems, such as Hadoop. Localizing the source of these performance problems can be frustrating due to the overwhelming amount of monitoring information available. We automate problem localization using ASDF an online diagnostic framework that transparently monitors and analyzes different time-varying data sources (e.g., OS performance counters, Hadoop logs) and narrows down performance problems to a specific node or a set of nodes. ASDF's flexible architecture allows system administrators to easily customize data sources and analysis modules for their unique operating environments. We demonstrate the effectiveness of ASDF's diagnostics on documented performance problems in Hadoop; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time and achieves a balanced accuracy of 80% at localizing problems to the culprit node. Problem-diagnosis techniques tend to gather data about the system and/or the application to develop a priori templates of normal, problem-free system behavior; the techniques then detect performance problems by looking for anomalies in runtime data, as compared to the templates. Typically, these analysis techniques are run offline and post-process the data gathered from the system. The data used to develop the models and to perform the diagnosis can be collected in different ways. A white-box diagnostic approach extracts application-level data directly and requires instrumenting the application and possibly understanding the application's internal structure or semantics. A black-box diagnostic approach aims to infer application behavior by extracting data transparently from the operating system or network without needing to instrument the application or to understand its internal structure or semantics. Obviously, it might not be scalable (in effort, time and cost) or even possible to employ a white-box approach in production environments that contain many third-party services, applications and users. A black-box approach also has its drawbacks-while such an approach can infer application behavior to some extent, it might not always be able to pinpoint the root cause of a performance problem. Typically, a black-box approach is more effective at problem localization, while a white-box approach extracts more information to ascertain the underlying root cause of a problem. Hybrid, or grey-box, diagnostic approaches leverage the strengths of both white-box and black-box approaches. There are two distinct problems that we pursued. First, we sought to address support for problem localization (what we call fingerpointing) online, in an automated manner, even as the system under diagnosis is running. Second, we sought to address the problem of automated fingerpointing for Hadoop [1], an open-source implementation of the MapReduce programming paradigm [2] that supports long-running, parallelized, dataintensive computations over a large cluster of nodes. This chapter describes ASDF, a flexible, online framework for fingerpointing that addresses the two problems outlined above. ASDF has API support to plug in different time-varying data sources, and to plug in various analysis modules to process this data. Both the data-collection and the data-analyses can proceed concurrently, while the system under diagnosis is executing. The data sources can be gathered in either a black-box or white-box manner, and can be diverse, coming from application logs, system-call traces, system logs, performance counters, etc. The analysis modules can be equally diverse, involving time-series analysis, machine learning, etc. We demonstrate how ASDF automatically fingerpoints some of the performance problems in Hadoop that are documented in Apache's JIRA issue tracker [3]. Manual fingerpointing does not scale in Hadoop environments because of the number of nodes and the number of performance metrics to be analyzed on each node. Our current implementation of ASDF for Hadoop automatically extracts time-varying white-box and black-box data sources on every node in a Hadoop cluster. ASDF then feeds these data sources into different analysis modules (that respectively perform clustering, peercomparison or Hadoop-log analysis), to identify the culprit node(s), in real time. A unique aspect of our Hadoop-centric fingerpointing is our ability to infer Hadoop states (as we chose to define them) by parsing the logs that are natively auto-generated by Hadoop. We then leverage the information about the states and the time-varying statetransition sequence to localize performance problems.

DIADS: a problem diagnosis tool for databases and storage area networks

Proceedings of the …, 2009

Many enterprise environments have databases running on network-attached storage infrastructure (referred to as Storage Area Networks or SANs). Both the database and the SAN are complex subsystems that are managed by separate teams of administrators. As often as not, database administrators have limited understanding of SAN configuration and behavior, and limited visibility into the SAN's run-time performance; and vice versa for the SAN administrators. Diagnosing the cause of performance problems is a challenging exercise in these environments. We propose to remedy the situation through a novel tool, called Diads, for database and SAN problem diagnosis. This demonstration proposal summarizes the technical innovations in Diads: (i) a powerful abstraction called Annotated Plan Graphs (APGs) that ties together the execution path of queries in the database and the SAN using low-overhead monitoring data, and (ii) a diagnosis workflow that combines domain-specific knowledge with machine-learning techniques. The scenarios presented in the demonstration are also described.

Using Comprehensive Analysis for Performance Debugging in Distributed Storage Systems

24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 2007

Achieving perfonnance, reliability, and scalability presents a unique set of challenges for large distributed storage. To identify problem areas, there must be a way for developers to have a comprehensive view of the entire storage system. That is, users must be able to understand both node specific behavior and complex relationships be tween nodes. We present a distributed file system profiling method that supports such analysis. Our approach is based on combining node-specific metrics into a single cohesive system image. This affords users two views of the storage system: a micro, per-node view, as well as, a macro, multi node view, allowing both node-specific and complex inter nodal problems to be debugged. We visualize the storage system by displaying nodes and intuitively animating their metrics and behavior allowing easy analysis of complex problems.

I/O System Performance Debugging Using Model-driven Anomaly Characterization

2005

It is challenging to identify performance problems and pinpoint their root causes in complex systems, especially when the system supports wide ranges of workloads and when performance problems only materialize under particular workload conditions. This paper proposes a model-driven anomaly characterization approach and uses it to discover operating system performance bugs when supporting disk I/O-intensive online servers. We construct a whole-system I/O throughput model as the reference of expected performance and we use statistical clustering and characterization of performance anomalies to guide debugging. Unlike previous performance debugging methods offering detailed statistics at specific execution settings, our approach focuses on comprehensive anomaly characterization over wide ranges of workload conditions and system configurations.

Black-box problem diagnosis in parallel file systems

Proceedings of the 8th …, 2010

We focus on automatically diagnosing different performance problems in parallel file systems by identifying, gathering and analyzing OS-level, black-box performance metrics on every node in the cluster. Our peercomparison diagnosis approach compares the statistical attributes of these metrics across I/O servers, to identify the faulty node. We develop a root-cause analysis procedure that further analyzes the affected metrics to pinpoint the faulty resource (storage or network), and demonstrate that this approach works commonly across stripe-based parallel file systems. We demonstrate our approach for realistic storage and network problems injected into three different file-system benchmarks (dd, IOzone, and Post-Mark), in both PVFS and Lustre clusters.

Log2: A Cost-Aware Logging Mechanism for Performance Diagnosis

2015

Logging has been a common practice for monitoring and diagnosing performance issues. However, logging comes at a cost, especially for large-scale online service systems. First, the overhead incurred by intensive logging is non-negligible. Second, it is costly to diagnose a performance issue if there are a tremendous amount of redundant logs. Therefore, we believe that it is important to limit the overhead incurred by logging, without sacrificing the logging effectiveness. In this paper we propose Log2, a cost-aware logging mechanism. Given a "budget" (defined as the maximum volume of logs allowed to be output in a time interval), Log2 makes the "whether to log" decision through a two-phase filtering mechanism. In the first phase, a large number of irrelevant logs are discarded efficiently. In the second phase, useful logs are cached and output while complying with logging budget. In this way, Log2 keeps the useful logs and discards the less useful ones. We have i...

An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications

Proceedings of the ACM Symposium on Cloud Computing, 2019

Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications. In response to a newly-observed performance problem, Pythia searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose it. Our vision for Pythia builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application's nodes (i.e., records their workflows). It uses the key insight that localizing the sources high performance variation within the workflows of requests that are expected to perform similarly gives insight into where additional instrumentation is needed. CCS CONCEPTS • Computer systems organization → Cloud computing; • Software and its engineering → Software testing and debugging; • Computing methodologies → Distributed computing methodologies.

Performance troubleshooting in data centers

ACM SIGOPS Operating Systems Review, 2013

* this work was funded in part by the Intel Science and Technology Center for Cloud Computing (ISTC-CC) † Chengwei Wang and Soila P. Kavulya contributed equally to this literature review project

NetLogger: A Toolkit for Distributed System Performance Analysis

2000

Developers and users of high-performance distributed systems often observe performance problems such as unexpectedly low throughput or high latency. Determining the source of the performance problems requires detailed end-to-end instrumentation of all components, including the applications, operating systems, hosts, and networks. In this paper we describe a methodology that enables the real-time diagnosis of performance problems in complex high-performance distributed systems. The methodology includes tools for generating timestamped event logs that can be used to provide detailed end-to-end application and system level monitoring; and tools for visualizing the log data and real-time state of the distributed system. This methodology, called NetLogger, has proven invaluable for diagnosing problems in networks and in distributed systems code. This approach is novel in that it combines network, host, and application-level monitoring, providing a complete view of the entire system. NetLogger is designed to be extremely light-weight, and includes a mechanism for reliably collecting monitoring events from multiple distributed locations. This technical report summarizes most important points of several previous papers on NetLogger, and is meant to be used as a general overview.