Martin Kopp - Academia.edu (original) (raw)
Papers by Martin Kopp
The main objective of outlier detection is finding samples considerably deviating from the majori... more The main objective of outlier detection is finding samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover interesting events within data. Consequently, a considerable amount of statistical and data mining techniques to identify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was labelled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as conjunctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by humans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anomalies into groups. Such cluste...
Techniques are described herein for clustering network hosts based on their network behavior to c... more Techniques are described herein for clustering network hosts based on their network behavior to create groups of hosts that behave similarly. An anomaly detection model trained on a single group of network hosts is more robust to fluctuations of the behavior of individual hosts when compared to the per host models. When comparing to the group all models that are trained using the behavior of all network hosts, finer anomalies (e.g., stealthy data exfiltration) that would otherwise be hidden may be detected by modelling diversely behaving network hosts. DETAILED DESCRIPTION Network behavior anomaly detection (NBAD) systems are complementary to the traditional network security systems based on deep packet inspection (DPI). Contrary to DPI, NBAD can detect new zero-day attacks (i.e., attacks without known signatures) and work even with encrypted traffic. As a result, NBAD adoption continuously grows. NBAD systems detect threats by tracking various network characteristics in real time a...
The main objective of anomaly or outlier detection algorithms is finding samples deviating from t... more The main objective of anomaly or outlier detection algorithms is finding samples deviating from the majority. Although a vast number of algorithms designed for this already exist, almost none of them explain, why a particular sample was labelled as an anomaly (outlier). To address this issue, we propose an algorithm called Explainer, which returns the explanation of sample’s differentness in disjunctive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algorithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, data-streams, and real-time problems. The correctness of Explainer is demonstrated on a wide range of synthetic and real world datasets.
Expert Systems with Applications, 2020
Abstract Anomaly detection has become an important topic in many domains with many different solu... more Abstract Anomaly detection has become an important topic in many domains with many different solutions proposed until now. Despite that, there are only a few anomaly detection methods trying to explain how the sample differs from the rest. This work contributes to filling this gap because knowing why a sample is considered anomalous is critical in many application domains. The proposed solution uses a specific type of random forests to extract rules explaining the difference, which are then filtered and presented to the user as a set of classification rules sharing the same consequent, or as the equivalent rule with an antecedent in a disjunctive normal form. The quality of that solution is documented by comparison with the state of the art algorithms on 34 real-world datasets.
The main objective of outlier detection is finding samples considerably deviating from the majori... more The main objective of outlier detection is finding samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover interesting events within data. Consequently, a considerable amount of statistical and data mining techniques to identify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was labelled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as conjunctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by humans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anomalies into groups. Such cluste...
Techniques are described herein for clustering network hosts based on their network behavior to c... more Techniques are described herein for clustering network hosts based on their network behavior to create groups of hosts that behave similarly. An anomaly detection model trained on a single group of network hosts is more robust to fluctuations of the behavior of individual hosts when compared to the per host models. When comparing to the group all models that are trained using the behavior of all network hosts, finer anomalies (e.g., stealthy data exfiltration) that would otherwise be hidden may be detected by modelling diversely behaving network hosts. DETAILED DESCRIPTION Network behavior anomaly detection (NBAD) systems are complementary to the traditional network security systems based on deep packet inspection (DPI). Contrary to DPI, NBAD can detect new zero-day attacks (i.e., attacks without known signatures) and work even with encrypted traffic. As a result, NBAD adoption continuously grows. NBAD systems detect threats by tracking various network characteristics in real time a...
The main objective of anomaly or outlier detection algorithms is finding samples deviating from t... more The main objective of anomaly or outlier detection algorithms is finding samples deviating from the majority. Although a vast number of algorithms designed for this already exist, almost none of them explain, why a particular sample was labelled as an anomaly (outlier). To address this issue, we propose an algorithm called Explainer, which returns the explanation of sample’s differentness in disjunctive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algorithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, data-streams, and real-time problems. The correctness of Explainer is demonstrated on a wide range of synthetic and real world datasets.
Expert Systems with Applications, 2020
Abstract Anomaly detection has become an important topic in many domains with many different solu... more Abstract Anomaly detection has become an important topic in many domains with many different solutions proposed until now. Despite that, there are only a few anomaly detection methods trying to explain how the sample differs from the rest. This work contributes to filling this gap because knowing why a sample is considered anomalous is critical in many application domains. The proposed solution uses a specific type of random forests to extract rules explaining the difference, which are then filtered and presented to the user as a set of classification rules sharing the same consequent, or as the equivalent rule with an antecedent in a disjunctive normal form. The quality of that solution is documented by comparison with the state of the art algorithms on 34 real-world datasets.