Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs (original) (raw)

Mining Console Logs for Large-Scale System Problem Detection

Proceedings of the Third Conference on Tackling Computer Systems Problems With Machine Learning Techniques, 2008

The console logs generated by an application contain messages that the application developers believed would be useful in debugging or monitoring the application. Despite the ubiquity and large size of these logs, they are rarely exploited in a systematic way for monitoring and debugging because they are not readily machine-parsable. In this paper, we propose a novel method for mining this rich source of information. First, we combine log parsing and text mining with source code analysis to extract structure from the console logs. Second, we extract features from the structured information in order to detect anomalous patterns in the logs using Principal Component Analysis (PCA). Finally, we use a decision tree to distill the results of PCA-based anomaly detection to a format readily understandable by domain experts (e.g. system operators) who need not be familiar with the anomaly detection algorithms. As a case study, we distill over one million lines of console logs from the Hadoop file system to a simple decision tree that a domain expert can readily understand; the process requires no operator intervention and we detect a large portion of runtime anomalies that are commonly overlooked.

Anomaly Detection via Mining Numerical Workflow Relations from Logs

2020

Complex software intensive systems, especially distributed systems, generate logs for troubleshooting. The logs are text messages recording system events, which can help engineers determine the system's runtime status. This paper proposes a novel approach named ADR (stands for Anomaly Detection by workflow Relations) that employs matrix nullspace to mine numerical relations from log data. The mined relations can be used for both offline and online anomaly detection and facilitate fault diagnosis. We have evaluated ADR on log data collected from two distributed systems, HDFS (Hadoop Distributed File System) and BGL (IBM Blue Gene/L supercomputers system). ADR successfully mined 87 and 669 numerical relations from the logs and used them to detect anomalies with high precision and recall. For online anomaly detection, ADR employs PSO (Particle Swarm Optimization) to find the optimal sliding windows' size and achieves fast anomaly detection.The experimental results confirm that ...

LogGD: Detecting Anomalies from System Logs with Graph Neural Networks

2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS), 2022

Log analysis is one of the main techniques engineers use to troubleshoot faults of large-scale software systems. During the past decades, many log analysis approaches have been proposed to detect system anomalies reflected by logs. They usually take log event counts or sequential log events as inputs and utilize machine learning algorithms including deep learning models to detect system anomalies. These anomalies are often identified as violations of quantitative relational patterns or sequential patterns of log events in log sequences. However, existing methods fail to leverage the spatial structural relationships among log events, resulting in potential false alarms and unstable performance. In this study, we propose a novel graph-based log anomaly detection method, LogGD, to effectively address the issue by transforming log sequences into graphs. We exploit the powerful capability of Graph Transformer Neural Network, which combines graph structure and node semantics for logbased anomaly detection. We evaluate the proposed method on four widely-used public log datasets. Experimental results show that LogGD can outperform state-of-the-art quantitative-based and sequence-based methods and achieve stable performance under different window size settings. The results confirm that LogGD is effective in log-based anomaly detection.

Log summarization and anomaly detection for troubleshooting distributed systems

2007 8th IEEE/ACM International Conference on Grid Computing, 2007

Today's system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.

Tractable flow analysis for anomaly detection in distributed programs

1993

Each process in a distributed program or design can be modelled as a process flow graph, where nodes represent program statements and directed edges represent control flows. This paper describes a flow analysis method to detect unreachable statements by examining the control flows and communication patterns in a collection of process flow graphs. The method can analyse programs with loops, non-deterministic structures and synchronous communication using an algorithm with a quadratic complexity in terms of program size. The method follows an approach described by Reif and Smolka [9] but delivers a more accurate result in assessing the reachability of statements. The higher accuracy is achieved using three techniques: statement dependency, history sets and statement re-reachability. The method is illustrated by a pump control application for a mining environment. A prototype has been implemented and its performance is presented.

Utilizing Persistence for Post Facto Suppression of Invalid Anomalies Using System Logs

2022 IEEE/ACM 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

The robustness and availability of cloud services are becoming increasingly important as more applications migrate to the cloud. The operations landscape today is more complex, than ever. Site reliability engineers (SREs) are expected to handle more incidents than ever before with shorter service-level agreements (SLAs). By exploiting log, tracing, metric, and network data, Artificial Intelligence for IT Operations (AIOps) enables detection of faults and anomalous issues of services. A wide variety of anomaly detection techniques have been incorporated in various AIOps platforms (e.g. PCA and autoencoder), but they all suffer from false positives. In this paper, we propose an unsupervised approach for persistent anomaly detection on top of the traditional anomaly detection approaches, with the goal of reducing false positives and providing more trustworthy alerting signals. We test our method on both simulated and real-world datasets. Our technique reduces false positive anomalies by at least 28%, resulting in more reliable and trustworthy notifications. CCS CONCEPTS • Computing methodologies → Anomaly detection; • Software and its engineering → Maintaining software.

Real-time anomaly detection in logs using rule mining and complex event processing at scale

2019

Log data, produced from every computer system and program, are widely used as source of valuable information to monitor and understand their behavior and their health. However, as large-scale systems generate a massive amount of log data every minute, it is impossible to detect the cause of system failure by examining manually this huge size of data. Thus, there is a need for an automated tool for finding system's failure with little or none human effort. Nowadays lots of methods exist that try to detect anomalies on system's logs by analyzing and applying various algorithms such as machine learning algorithms. However, experts argue that a system error can not be found by looking into a single event, but in multiple log event data are necessary to understand the root cause of a problem. In this thesis work, we aim to detect patterns in sequential distributed system's logs that can capture effectively the abnormal behavior. Specifically as a first step, we will apply rule mining techniques to extract rules that represent an anomalous behavior, which potentially in the future may lead to a failure of a system. Except for that step, we implemented a real-time anomaly detection framework to detect problems before they actually occur. Processing log data as streams is the only way to achieve a real-time detection concept. In that direction we will process streaming log data using a complex event processing technique. Specifically, we would like to combine rule mining algorithms with complex event processing engine to raise alerts on abnormal log data based on automatically generated patterns. The evaluation of the work is conducted on Hadoop's logs, a widely used system in the industry. The outcome of this thesis project gives really promising results, reaching a Recall of 98% in detecting anomalies. Finally, a scalable anomaly detection framework was build by integrating different systems into the cloud. The motivation behind this is the direct application of our framework to a real-life use case.

LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems

2012 IEEE 31st Symposium on Reliable Distributed Systems, 2012

This paper presents a methodology and a system, named LogMaster, for mining correlations of events that have multiple attributions, i.e., node ID, application ID, event type, and event severity, in logs of large-scale cluster systems. Different from traditional transactional data, e.g., supermarket purchases, system logs have their unique characteristic, and hence we propose several innovative approaches to mine their correlations. We present a simple metrics to measure correlations of events that may happen interleavedly. On the basis of the measurement of correlations, we propose two approaches to mine event correlations; meanwhile, we propose an innovative abstractionevent correlation graphs (ECGs) to represent event correlations, and present an ECGs-based algorithm for predicting events. For two system logs of a production Hadoop-based cloud computing system at Research Institution of China Mobile and a production HPC cluster system at Los Alamos National Lab (LANL), we evaluate our approaches in three scenarios: (a) predicting all events on the basis of both failure and non-failure events; (b) predicting only failure events on the basis of both failure and non-failure events; (c) predicting failure events after removing non-failure events.