Ensembles for unsupervised outlier detection (original) (raw)

Ensembles for unsupervised outlier detection: challenges and research questions a position paper

Published: 17 March 2014 Publication History

Abstract

Ensembles for unsupervised outlier detection is an emerging topic that has been neglected for a surprisingly long time (although there are reasons why this is more difficult than supervised ensembles or even clustering ensembles). Aggarwal recently discussed algorithmic patterns of outlier detection ensembles, identified traces of the idea in the literature, and remarked on potential as well as unlikely avenues for future transfer of concepts from supervised ensembles. Complementary to his points, here we focus on the core ingredients for building an outlier ensemble, discuss the first steps taken in the literature, and identify challenges for future research.

References

[1]

N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 504--509, 2006.

[2]

C. C. Aggarwal. Outlier ensembles {position paper}. ACM SIGKDD Explorations, 14(2):49--58, 2012.

[3]

C. C. Aggarwal. Outlier Analysis. Springer, 2013.

[4]

F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discoverys (PKDD), Helsinki, Finland, pages 15--26, 2002.

[5]

H. Ayad and M. Kamel. Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In 4th International Workshop on Multiple Classifier Systems (MCS), Guildford, UK, pages 166--175, 2003.

[6]

J. Azimi and X. Fern. Adaptive cluster ensemble selection. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), Pasadena, CA, pages 992--997, 2009.

[7]

K. Bache and M. Lichman. UCI machine learning repository, 2013.

[8]

V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley&Sons, 3rd edition, 1994.

[9]

M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 93--104, 2000.

[10]

G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.

[11]

R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 107--118, 2006.

[12]

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):Article 15, 1--58, 2009.

[13]

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823--839, 2012.

[14]

X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL, 2014.

[15]

X. H. Dang, B. Micenkova, I. Assent, and R. Ng. Outlier detection with space transformation and spectral analysis. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 225--233, 2013.

[16]

T. G. Dietterich. Ensemble methods in machine learning. In First International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, pages 1--15, 2000.

[17]

P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, pages 105--112, 1996.

[18]

A. F. Emmott, S. Das, T. Dietterich, A. Fern, and W.-K. Wong. Systematic construction of anomaly detection benchmarks from real data. In Workshop on Outlier Detection and Description, held in conjunction with the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2013.

[19]

X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC, pages 186--193, 2003.

[20]

X. Z. Fern and W. Lin. Cluster ensemble selection. Statistical Analysis and Data Mining, 1(3):128--141, 2008.

[21]

A. L. N. Fred and A. K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835--850, 2005.

[22]

I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.

[23]

J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 212--221, 2006.

[24]

J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011.

[25]

A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.

[26]

F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):1--21, 1969.

[27]

S. T. Hadjitodorov and L. I. Kuncheva. Selecting diversifying heuristics for cluster ensembles. In 7th International Workshop on Multiple Classifier Systems (MCS), Prague, Czech Republic, pages 200--209, 2007.

[28]

S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006.

[29]

J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143:29--36, 1982.

[30]

D. Hawkins. Identification of Outliers. Chapman and Hall, 1980.

[31]

M. S. Hossain, S. Tadepalli, L. T. Watson, I. Davidson, R. F. Helm, and N. Ramakrishnan. Unifying dependent clustering and disparate clustering for nonhomogeneous data. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 593--602, 2010.

[32]

L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, 1985.

[33]

N. Iam-On and T. Boongoen. Comparative study of matrix refinement approaches for ensemble clustering. Machine Learning, 2013.

[34]

W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, pages 577--593, 2006.

[35]

L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley & Sons, 1990.

[36]

F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC, 2012.

[37]

E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, pages 219--222, 1997.

[38]

H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1649--1652, 2009.

[39]

H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 831--838, 2009.

[40]

H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, pages 13--24, 2011.

[41]

H.-P. Kriegel, P. Kröger, and A. Zimek. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351--364, 2012.

[42]

H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 444--452, 2008.

[43]

L. I. Kuncheva and S. T. Hadjitodorov. Using diversity in cluster ensembles. In Proceedings of the 2004 IEEE International Conference on Systems, Man, and Cybernetics (ICSMC), The Hague, Netherlands, pages 1214--1219, 2004.

[44]

L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51:181--207, 2003.

[45]

A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 157--166, 2005.

[46]

F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3:1--39, 2012.

[47]

M. J. A. N. C. Marquis de Condorcet. Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix. L'Imprimerie Royale, Paris, 1785.

[48]

M. Meila. Comparing clusterings -- an axiomatic view. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany, pages 577--584, 2005.

[49]

D. Moulavi, P. A. Jaskowiak, R. J. G. B. Campello, A. Zimek, and J. Sander. Density-based clustering validation. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.

[50]

E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discovering multiple clustering solutions: Grouping objects in different views of the data. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia, page 1220, 2010.

[51]

E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, pages 529--538, 2012.

[52]

E. Müller, M. Schiffer, and T. Seidl. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany, pages 434--445, 2011.

[53]

M. C. Naldi, A. C. P. L. F. Carvalho, and R. J. G. B. Campello. Cluster ensemble selection based on relative validity indexes. Data Mining and Knowledge Discovery, 27(2):259--289, 2013.

[54]

H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, pages 368--383, 2010.

[55]

N. Nguyen and R. Caruana. Consensus clusterings. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 607--612, 2007.

[56]

S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, pages 315--326, 2003.

[57]

D. Pfitzner, R. Leibbrandt, and D. Powers. Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems (KAIS), 19(3):361--394, 2009.

[58]

N. Pham and R. Pagh. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, 2012.

[59]

Z. J. Qi and I. Davidson. A principled and flexible framework for finding alternative clusterings. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 717--726, 2009.

[60]

S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 427--438, 2000.

[61]

W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.

[62]

L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010.

[63]

E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pages 1047--1058, 2012.

[64]

E. Schubert, A. Zimek, and H.-P. Kriegel. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.

[65]

E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 28(1):190--237, 2014.

[66]

K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2):332--397, 2013.

[67]

A. Strehl and J. Ghosh. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002.

[68]

A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1866--1881, 2005.

[69]

A. P. Topchy, M. H. C. Law, A. K. Jain, and A. L. Fred. Analysis of consensus partition in cluster ensemble. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 225--232, 2004.

[70]

G. Valentini and F. Masulli. Ensembles of learning machines. In Proceedings of the 13th Italian Workshop on Neural Nets, Vietri, Italy, pages 3--22, 2002.

[71]

L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--235, 2010.

[72]

L. Vendramin, P. A. Jaskowiak, and R. J. G. B. Campello. On the combination of relative clustering validity criteria. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM), Baltimore, MD, pages 4:1--12, 2013.

[73]

J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 776--784, 2008.

[74]

K. Zhang, M. Hutter, and H. Jin. A new local distancebased outlier detection approach for scattered realworld data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 813--822, 2009.

[75]

A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, 2013.

[76]

A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363--387, 2012.

[77]

A. Zimek and J. Vreeken. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 2013.

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 15, Issue 1

June 2013

50 pages

Copyright © 2014 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 March 2014

Published in SIGKDD Volume 15, Issue 1

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Affiliations

Arthur Zimek

Ludwig-Maximilians-Universität, Munich, Germany

Ricardo J.G.B. Campello

University of São Paulo, São Carlos, Brazil

Jörg Sander

University of Alberta, Edmonton, AB, Canada