Ensembles for unsupervised outlier detection (original) (raw)
Ensembles for unsupervised outlier detection: challenges and research questions a position paper
Published: 17 March 2014 Publication History
Abstract
Ensembles for unsupervised outlier detection is an emerging topic that has been neglected for a surprisingly long time (although there are reasons why this is more difficult than supervised ensembles or even clustering ensembles). Aggarwal recently discussed algorithmic patterns of outlier detection ensembles, identified traces of the idea in the literature, and remarked on potential as well as unlikely avenues for future transfer of concepts from supervised ensembles. Complementary to his points, here we focus on the core ingredients for building an outlier ensemble, discuss the first steps taken in the literature, and identify challenges for future research.
References
[1]
N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 504--509, 2006.
[2]
C. C. Aggarwal. Outlier ensembles {position paper}. ACM SIGKDD Explorations, 14(2):49--58, 2012.
[3]
C. C. Aggarwal. Outlier Analysis. Springer, 2013.
[4]
F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discoverys (PKDD), Helsinki, Finland, pages 15--26, 2002.
[5]
H. Ayad and M. Kamel. Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In 4th International Workshop on Multiple Classifier Systems (MCS), Guildford, UK, pages 166--175, 2003.
[6]
J. Azimi and X. Fern. Adaptive cluster ensemble selection. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), Pasadena, CA, pages 992--997, 2009.
[7]
K. Bache and M. Lichman. UCI machine learning repository, 2013.
[8]
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley&Sons, 3rd edition, 1994.
[9]
M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 93--104, 2000.
[10]
G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.
[11]
R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 107--118, 2006.
[12]
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):Article 15, 1--58, 2009.
[13]
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823--839, 2012.
[14]
X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL, 2014.
[15]
X. H. Dang, B. Micenkova, I. Assent, and R. Ng. Outlier detection with space transformation and spectral analysis. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 225--233, 2013.
[16]
T. G. Dietterich. Ensemble methods in machine learning. In First International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, pages 1--15, 2000.
[17]
P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, pages 105--112, 1996.
[18]
A. F. Emmott, S. Das, T. Dietterich, A. Fern, and W.-K. Wong. Systematic construction of anomaly detection benchmarks from real data. In Workshop on Outlier Detection and Description, held in conjunction with the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2013.
[19]
X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC, pages 186--193, 2003.
[20]
X. Z. Fern and W. Lin. Cluster ensemble selection. Statistical Analysis and Data Mining, 1(3):128--141, 2008.
[21]
A. L. N. Fred and A. K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835--850, 2005.
[22]
I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.
[23]
J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 212--221, 2006.
[24]
J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011.
[25]
A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.
[26]
F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):1--21, 1969.
[27]
S. T. Hadjitodorov and L. I. Kuncheva. Selecting diversifying heuristics for cluster ensembles. In 7th International Workshop on Multiple Classifier Systems (MCS), Prague, Czech Republic, pages 200--209, 2007.
[28]
S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006.
[29]
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143:29--36, 1982.
[30]
D. Hawkins. Identification of Outliers. Chapman and Hall, 1980.
[31]
M. S. Hossain, S. Tadepalli, L. T. Watson, I. Davidson, R. F. Helm, and N. Ramakrishnan. Unifying dependent clustering and disparate clustering for nonhomogeneous data. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 593--602, 2010.
[32]
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, 1985.
[33]
N. Iam-On and T. Boongoen. Comparative study of matrix refinement approaches for ensemble clustering. Machine Learning, 2013.
[34]
W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, pages 577--593, 2006.
[35]
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley & Sons, 1990.
[36]
F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC, 2012.
[37]
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, pages 219--222, 1997.
[38]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1649--1652, 2009.
[39]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 831--838, 2009.
[40]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, pages 13--24, 2011.
[41]
H.-P. Kriegel, P. Kröger, and A. Zimek. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351--364, 2012.
[42]
H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 444--452, 2008.
[43]
L. I. Kuncheva and S. T. Hadjitodorov. Using diversity in cluster ensembles. In Proceedings of the 2004 IEEE International Conference on Systems, Man, and Cybernetics (ICSMC), The Hague, Netherlands, pages 1214--1219, 2004.
[44]
L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51:181--207, 2003.
[45]
A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 157--166, 2005.
[46]
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3:1--39, 2012.
[47]
M. J. A. N. C. Marquis de Condorcet. Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix. L'Imprimerie Royale, Paris, 1785.
[48]
M. Meila. Comparing clusterings -- an axiomatic view. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany, pages 577--584, 2005.
[49]
D. Moulavi, P. A. Jaskowiak, R. J. G. B. Campello, A. Zimek, and J. Sander. Density-based clustering validation. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.
[50]
E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discovering multiple clustering solutions: Grouping objects in different views of the data. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia, page 1220, 2010.
[51]
E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, pages 529--538, 2012.
[52]
E. Müller, M. Schiffer, and T. Seidl. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany, pages 434--445, 2011.
[53]
M. C. Naldi, A. C. P. L. F. Carvalho, and R. J. G. B. Campello. Cluster ensemble selection based on relative validity indexes. Data Mining and Knowledge Discovery, 27(2):259--289, 2013.
[54]
H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, pages 368--383, 2010.
[55]
N. Nguyen and R. Caruana. Consensus clusterings. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 607--612, 2007.
[56]
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, pages 315--326, 2003.
[57]
D. Pfitzner, R. Leibbrandt, and D. Powers. Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems (KAIS), 19(3):361--394, 2009.
[58]
N. Pham and R. Pagh. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, 2012.
[59]
Z. J. Qi and I. Davidson. A principled and flexible framework for finding alternative clusterings. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 717--726, 2009.
[60]
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 427--438, 2000.
[61]
W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.
[62]
L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010.
[63]
E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pages 1047--1058, 2012.
[64]
E. Schubert, A. Zimek, and H.-P. Kriegel. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.
[65]
E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 28(1):190--237, 2014.
[66]
K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2):332--397, 2013.
[67]
A. Strehl and J. Ghosh. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002.
[68]
A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1866--1881, 2005.
[69]
A. P. Topchy, M. H. C. Law, A. K. Jain, and A. L. Fred. Analysis of consensus partition in cluster ensemble. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 225--232, 2004.
[70]
G. Valentini and F. Masulli. Ensembles of learning machines. In Proceedings of the 13th Italian Workshop on Neural Nets, Vietri, Italy, pages 3--22, 2002.
[71]
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--235, 2010.
[72]
L. Vendramin, P. A. Jaskowiak, and R. J. G. B. Campello. On the combination of relative clustering validity criteria. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM), Baltimore, MD, pages 4:1--12, 2013.
[73]
J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 776--784, 2008.
[74]
K. Zhang, M. Hutter, and H. Jin. A new local distancebased outlier detection approach for scattered realworld data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 813--822, 2009.
[75]
A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, 2013.
[76]
A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363--387, 2012.
[77]
A. Zimek and J. Vreeken. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 2013.
Information & Contributors
Information
Published In
ACM SIGKDD Explorations Newsletter Volume 15, Issue 1
June 2013
50 pages
Copyright © 2014 Authors.
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 17 March 2014
Published in SIGKDD Volume 15, Issue 1
Check for updates
Qualifiers
- Column
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- View Citations
- Downloads (Last 12 months)106
- Downloads (Last 6 weeks)11
Reflects downloads up to 21 Sep 2024
Other Metrics
Citations
- Mukhriya AKumar R(2025)Iterative target updation based boosting ensembles for outlier detectionPattern Recognition10.1016/j.patcog.2024.111023158(111023)Online publication date: Feb-2025
- Hossain MYounis MRobinson AWang LPreza C(2024)Greedy Ensemble Hyperspectral Anomaly DetectionJournal of Imaging10.3390/jimaging1006013110:6(131)Online publication date: 28-May-2024
- Yang JRahardja SRahardja S(2024)Regional Ensemble for Improving Unsupervised Outlier DetectorsIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.33811025:9(4391-4402)Online publication date: Sep-2024
- Nawaz AKhan SAhmad A(2024)Ensemble of Autoencoders for Anomaly Detection in Biomedical Data: A Narrative ReviewIEEE Access10.1109/ACCESS.2024.336069112(17273-17289)Online publication date: 2024
- Li JLi JWang CVerbeek FSchultz TLiu H(2024)MS2OD: outlier detection using minimum spanning tree and medoid selectionMachine Learning: Science and Technology10.1088/2632-2153/ad24925:1(015025)Online publication date: 12-Feb-2024
- Wang YSun YCao XWang YZhang WCheng XWang RZong J(2024)Automatic training sample collection utilizing multi-source land cover products and time-series Sentinel-2 imagesGIScience & Remote Sensing10.1080/15481603.2024.235295761:1Online publication date: 14-May-2024
- Rätz MHenkel PStoffel PStreblow RMüller D(2024)Identifying the validity domain of machine learning models in building energy systemsEnergy and AI10.1016/j.egyai.2023.10032415(100324)Online publication date: Jan-2024
- Rahmati FGharaei RNezamabadi-pour H(2024)ARDOD: adaptive radius density-based outlier detectionEvolutionary Intelligence10.1007/s12065-024-00953-4Online publication date: 22-Jun-2024
- Klüttermann SBalestra CMüller E(2024)On the Efficient Explanation of Outlier Detection Ensembles Through Shapley ValuesAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2259-4_4(43-55)Online publication date: 7-May-2024
- Wordliczek Ł(2023)Bridging Disciplinary Divides: Exploring the Synergy of Punctuated Equilibrium Theory and Artificial Neural Networks in Policy Change AnalysisBarometr Regionalny. Analizy i Prognozy10.56583/br.219119:2(195-212)Online publication date: 31-Dec-2023
- Show More Cited By
View Options
Get Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Full Access
View options
View or Download as a PDF file.
eReader
View online with eReader.
Media
Figures
Other
Tables
Affiliations
Arthur Zimek
Ludwig-Maximilians-Universität, Munich, Germany
Ricardo J.G.B. Campello
University of São Paulo, São Carlos, Brazil
Jörg Sander
University of Alberta, Edmonton, AB, Canada