Real-time anomaly detection in logs using rule mining and complex event processing at scale (original) (raw)

ADELE: Anomaly Detection from Event Log Empiricism

2018

A large population of users gets affected by sudden slowdown or shutdown of an enterprise application. System administrators and analysts spend considerable amount of time dealing with functional and performance bugs. These problems are particularly hard to detect and diagnose in most computer systems, since there is a huge amount of system generated supportability data (counters, logs etc.) that need to be analyzed. Most often, there isn't a very clear or obvious root cause. Timely identification of significant change in application behavior is very important to prevent negative impact on the service. In this paper, we present ADELE, an empirical, data-driven methodology for early detection of anomalies in data storage systems. The key feature of our solution is diligent selection of features from system logs and development of effective machine learning techniques for anomaly prediction. ADELE learns from system's own history to establish the baseline of normal behavior an...

Anomaly Detection in the Cloud: Detecting Security Incidents via Machine Learning

Communications in Computer and Information Science, 2013

Cloud computing is now on the verge of being embraced as a serious usage-model. However, while outsourcing services and workflows into the cloud provides indisputable benefits in terms of flexibility of costs and scalability, there is little advance in security (which can influence reliability), transparency and incident handling. The problem of applying the existing security tools in the cloud is twofold. First, these tools do not consider the specific attacks and challenges of cloud environments, e.g., cross-VM side-channel attacks. Second, these tools focus on attacks and threats at only one layer of abstraction, e.g., the network, the service, or the workflow layers. Thus, the semantic gap between events and alerts at different layers is still an open issue. The aim of this paper is to present ongoing work towards a Monitoring-as-a-Service anomaly detection framework in a hybrid or public cloud. The goal of our framework is twofold. First it closes the gap between incidents at different layers of cloud-sourced workflows, namely we focus both on the workflow and the infrastracture layers. Second, our framework tackles challenges stemming from cloud usage, like multi-tenancy. Our framework uses complex event processing rules and machine learning, to detect populate user-specified metrics that can be used to assess the security status of the monitored system.

Hybrid anomaly detection and prioritization for network logs at cloud scale

Proceedings of the Seventeenth European Conference on Computer Systems, 2022

Monitoring the health of large-scale systems requires significant manual effort, usually through the continuous curation of alerting rules based on keywords, thresholds and regular expressions, which might generate a flood of mostly irrelevant alerts and obscure the actual information operators would like to see. Existing approaches try to improve the observability of systems by intelligently detecting anomalous situations. Such solutions surface anomalies that are statistically significant, but may not represent events that reliability engineers consider relevant. We propose ADEPTUS, a practical approach for detection of relevant health issues in an established system. ADEPTUS combines statistics and unsupervised learning to detect anomalies with supervised learning and heuristics to determine which of the detected anomalies are likely to be relevant to the Site Reliability Engineers (SREs). ADEPTUS overcomes the labor-intensive prerequisite of obtaining anomaly labels for supervised learning by automatically extracting information from historic alerts and incident tickets. We leverage ADEPTUS for observability in the network infrastructure of IBM Cloud. We perform an extensive real-world evaluation on 10 months of logs generated by tens of thousands of network devices across 11 data centers and demonstrate that ADEPTUS achieves higher alerting accuracy than the rule-based log alerting solution, curated by domain experts, used by SREs daily.

Anomaly Detection for Data Streams in Large-Scale Distributed Heterogeneous Computing Environments

12th International Conference on Cyber Warfare and Security, ICCWS 2017., 2017

Counteracting cyber threats to ensure secure cyberspace faces great challenges as cyber-attacks are increasingly stealthy and sophisticated; the protected cyber domains exhibit rapidly growing complexity and scale. It is important to design big data-driven cyber security solutions that effectively and efficiently derive actionable intelligence from available heterogeneous sources of information using principled data analytic methods to defend against cyber threats. In this work, we present a scalable distributed framework to collect and process extreme-scale networking and computing system traffic and status data from multiple sources that collectively represent the system under study, and develop and apply real-time adaptive data analytics for anomaly detection to monitor, understand, maintain, and improve cybersecurity. The data analytics will integrate multiple sophisticated machine learning algorithms and human-in-the-loop for iterative ensemble learning. Given the volume, speed, and complex nature of the data gathered, plus the need of real-time data analytics, a scalable data processing framework needs to handle big data with low latency. Our proposed big-data analytics will be implemented using an Apache Spark computing cluster. The analytics developed will offer significant improvements over existing methods of anomaly detection in real time. Our preliminary evaluation studies have shown that the developed techniques achieve better capabilities of defending against cyber threats.

Anomaly Detection in a Large-Scale Cloud Platform

2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2021

Cloud computing is ubiquitous: more and more companies are moving the workloads into the Cloud. However, this rise in popularity challenges Cloud service providers, as they need to monitor the quality of their ever-growing offerings effectively. To address the challenge, we designed and implemented an automated monitoring system for the IBM Cloud Platform. This monitoring system utilizes deep learning neural networks to detect anomalies in near-real-time in multiple Platform components simultaneously. After running the system for a year, we observed that the proposed solution frees the DevOps team's time and human resources from manually monitoring thousands of Cloud components. Moreover, it increases customer satisfaction by reducing the risk of Cloud outages. In this paper, we share our solutions' architecture, implementation notes, and best practices that emerged while evolving the monitoring system. They can be leveraged by other researchers and practitioners to build anomaly detectors for complex systems.

SAQL: A Stream-based Query System for Real-Time Abnormal System Behavior Detection

ArXiv, 2018

Recently, advanced cyber attacks, which consist of a sequence of steps that involve many vulnerabilities and hosts, compromise the security of many well-protected businesses. This has led to the solutions that ubiquitously monitor system activities in each host (big data) as a series of events, and search for anomalies (abnormal behaviors) for triaging risky events. Since fighting against these attacks is a time-critical mission to prevent further damage, these solutions face challenges in incorporating expert knowledge to perform timely anomaly detection over the large-scale provenance data. To address these challenges, we propose a novel stream-based query system that takes as input, a real-time event feed aggregated from multiple hosts in an enterprise, and provides an anomaly query engine that queries the event feed to identify abnormal behaviors based on the specified anomalies. To facilitate the task of expressing anomalies based on expert knowledge, our system provides a doma...

Detection of Anomaly using Machine Learning: A Comprehensive Survey

International Journal of Emerging Technology and Advanced Engineering

Anomaly detection is an important element in the domain of security. As a result, we undertook a literature review on ML algorithms that identify abnormalities. In this paper, we are presenting a review of the 101 research articles describing ML techniques for anomaly detection published between 2015 - 2022.The goal of this paper is to review research papers that have used machine learning to develop anomaly detection algorithmThe forms of anomaly detection examined in this study include system log anomaly detection, network anomaly detection, cloud-based anomaly detection, and anomaly detection in the medical profession. After assessing the selected research articles, we present more than 10 applications of anomaly detection. Also, we have shared a range of datasets used in anomaly detection research, in addition to revealing 30+ new ML models employed in anomaly detection. We have discovered 55 new datasets for anomaly detection. We've noticed that the majority of researchers ...

Anomaly Detection in Log Records

Indonesian Journal of Electrical Engineering and Computer Science

In recent times complex software systems are continuously generating application and server logs for the events which had occurred in the past. These generated logs can be utilized for anomaly and intrusion detection. These log files can be used for detecting certain types of abnormalities or exceptions such as spikes in HTTP requests, number of exceptions raised in logs, etc. These types of events recorded in the log files are generally used for anomaly prediction and analysis in future. The proposed prototype for anomaly detection assumes that the log records are uploaded as input using a standard apache log format. Next, a prototype is developed to get the number of HTTP requests for outlier detection. Then anomalies in number of HTTP requests are detected using three techniques namely InterQuartileRange method, Moving averages and Median Absolute deviation. Once the outliers are detected, these outliers are removed from the current dataset. This output is given as input to the Multilayer Perceptron model to predict the number of HTTP requests at the next timestamp. This paper presents a web based model to automate the process of anomaly detection in log files.

Utilizing Persistence for Post Facto Suppression of Invalid Anomalies Using System Logs

2022 IEEE/ACM 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

The robustness and availability of cloud services are becoming increasingly important as more applications migrate to the cloud. The operations landscape today is more complex, than ever. Site reliability engineers (SREs) are expected to handle more incidents than ever before with shorter service-level agreements (SLAs). By exploiting log, tracing, metric, and network data, Artificial Intelligence for IT Operations (AIOps) enables detection of faults and anomalous issues of services. A wide variety of anomaly detection techniques have been incorporated in various AIOps platforms (e.g. PCA and autoencoder), but they all suffer from false positives. In this paper, we propose an unsupervised approach for persistent anomaly detection on top of the traditional anomaly detection approaches, with the goal of reducing false positives and providing more trustworthy alerting signals. We test our method on both simulated and real-world datasets. Our technique reduces false positive anomalies by at least 28%, resulting in more reliable and trustworthy notifications. CCS CONCEPTS • Computing methodologies → Anomaly detection; • Software and its engineering → Maintaining software.

Log Analyzer and Anomaly Reporting Framework for Proactive System Monitoring

International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2024

In an era where organizations increasingly rely on intricate software applications, cloud services, and interconnected networks, the significance of analyses cannot be overstated. These tools serve as vigilant custodians of digital footprints, meticulously dissecting the voluminous records encapsulated in log files to extract valuable insights and detect anomalies. As such, log analyses emerge as linchpins in deciphering the meaning behind recorded events, enabling organizations to obtain a more comprehensive understanding of their digital infrastructures and enhance their security posture. This comprehensive exploration aims to untangle the core attributes of log analyses, bringing clarity to their parsing capabilities, the art of information extraction, and the nuanced algorithms that facilitate the conversion of raw logs into actionable insights. Furthermore, against the backdrop of a dynamically evolving cyber threat landscape, the role of log analyses extends beyond conventional diagnostics. These tools have become instrumental in the proactively orchestrated defense against cyber adversaries, empowering organizations to detect and mitigate threats in real-time. Through an in-depth analysis of log analyses and their evolving functionalities, this paper seeks to provide a comprehensive understanding of their Integral role in modern cybersecurity and system management. By elucidating the significance and impact of log analyses, organizations can leverage these tools to fortify their defenses, mitigate risks, and make informed, data-driven decisions in an increasingly complex digital environment.

Real-time anomaly detection in logs using rule mining and complex event processing at scale (original) (raw)

Related papers