Feature bagging for outlier detection | Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (original) (raw)

Authors Info & Claims

Published: 21 August 2005 Publication History

Abstract

Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide non-trivial improvements over the base algorithm.

References

[1]

C. Aggarwal, Re-designing distance functions and distance-based applications for high dimensional data, ACM SIGMOD Record, vol. 30, 1, pp. 13 -- 18, March 2001.]]

[2]

C. Aggarwal and P. Yu, Finding Generalized Projected Clusters in High Dimensional Spaces, In Proceedings of the ACM SIGMOD international conference on Management of data, Dallas, TX, 70--81, 2000.]]

[3]

C.C. Aggarwal, P. Yu, Outlier Detection for High Dimensional Data, In Proceedings of the ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, May 2001.]]

[4]

R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, In Proceedings of the ACM SIGMOD international conference on Management of data, Seattle, WA, 94--105, June 1998.]]

[5]

V. Barnett and T. Lewis, Outliers in Statistical Data. New York, NY, John Wiley and Sons, 1994.]]

[6]

K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, When is nearest neighbor meaningful?, In Proceedings of the 7th International Conference on Database Theory (ICDT'99), Jerusalem, Israel, 217--235, 1999.]]

[7]

N. Billor, A. Hadi and P. Velleman BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators, Computational Statist & Data Analysis, vol. 34, pp. 279--298, 2000.]]

[8]

C. Blake,C. Merz, UCI Repository of machine learning databases,www.ics.uci.edu/\~mlearn/MLRepository.html, 1998.]]

[9]

L. Breiman, Bagging Predictors, Machine Learning, vol. 24, 2, pp. 123--140, August 1996.]]

[10]

M.M. Breunig, H.P. Kriegel, R.T. Ng and J. Sander, LOF: Identifying DensityBased Local Outliers, ACM SIGMOD Conference, vol. Dallas, TX, May 2000.]]

[11]

N. Chawla, A. Lazarevic, L. Hall,K. Bowyer, SMOTEBoost: Improving the Prediction of Minority Class in Boosting, In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD-2003, Cavtat, Croatia, September 2003.]]

[12]

E. Eskin, Anomaly Detection over Noisy Data using Learned Probability Distributions, In Proceedings of the International Conference on Machine Learning, Stanford University, CA, 2000.]]

[13]

E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data, in Applications of Data Mining in Computer Security, Advances In Information Security, S. Jajodia D. Barbara, Ed. Boston: Kluwer, 2002.]]

[14]

Y. Freund, R. Schapire, Experiments with a New Boosting Algorithm, In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 325--332, July 1996.]]

[15]

S. Hawkins, H. He, G. Williams, R. Baxter, Outlier Detection Using Replicator Neural Networks, In Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science 2454, Aix-en-Provence, France, 170--180, September 2002.]]

[16]

M. Joshi, R. Agarwal, V. Kumar, PNrule, Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction, In Proceedings of the ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, May 2001.]]

[17]

M. Joshi, R. Agarwal and V. Kumar, Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong?, In Proceedings of the Eight ACM Conference ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002.]]

[18]

M. Joshi and V. Kumar, CREDOS: Classification using Ripple Down Structure (A Case for Rare Classes), In Proceedings of the SIAM International Conference on Data Mining, Lake Buena Vista, FL, April 2004.]]

[19]

E. Knorr and R. Ng, Algorithms for Mining Distance based Outliers in Large Data Sets, In Proceedings of the Very Large Databases (VLDB) Conference, New York City, NY, August 1998.]]

[20]

E. Kong and T. Dietterich, Error-Correcting Output Coding Corrects Bias and Variance, In Proceedings of the 12th International Conference on Machine Learning, San Francisco, CA, 313--321, 1995.]]

[21]

A. Lazarevic, L. Ertoz, A. Ozgur, J. Srivastava and V. Kumar, A comparative study of anomaly detection schemes in network intrusion detection, In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, May 2003.]]

[22]

M. Maloof, P. Langley, T. Binford, R. Nevatia and S. Sage, Improved Rooftop Detection in Aerial Images with Machine Learning, Machine Learning, vol. 53, 1--2, pp. 157--191, October-November 2003.]]

[23]

M. Markou and S. Singh, Novelty detection: a review--part 1: statistical approaches, Signal Processing, vol. 83, 12, pp. 2481--2497, December 2003.]]

[24]

P. McBurney and Y. Ohsawa, Chance Discovery, Advanced Information Processing Springer, 2003.]]

[25]

R. Michalski, I. Mozetic, J. Hong and N. Lavrac, The Multi-Purpose Incremental Learning System AQ15 and its Testing Applications to Three Medical Domains, In Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA, 1041--1045, 1986.]]

[26]

F. Provost, T. Fawcett, Robust Classification for Imprecise Environments, Machine Learning, vol. 42, pp. 203--231, 2001.]]

[27]

S. Ramaswamy, R. Rastogi, K. Shim, Efficient Algorithms for Mining Outliers from Large Data Sets, In Proceedings of the ACM SIGMOD Conference, Dallas, TX, May 2000.]]

[28]

A. Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, vol. 3, pp. 583--617, March 2003.]]

[29]

E. Suzuki, J. Zytkow, Unified Algorithm for Undirected Discovery of Exception Rules, In Proceedings of the Principles of Data Mining and Knowledge Discovery, 4th European Conference, PKDD2000, Lyon, France, 169--180, September 13-16, 2000.]]

[30]

P. van der Putten, M. van Someren, CoIL Challenge 2000: The Insurance Company Case, Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, Leiden LIACS Technical Report 2000-09, June, 2000.]]

[31]

D. Yu, G. Sheikholeslami and A. Zhang, FindOut: Finding Outliers in Very Large Datasets, The Knowledge and Information Systems (KAIS) journal, vol. 4, 4, October 2002.]]

[32]

A. E. Howe, D. Dreilinger, SavvySearch: A meta-search engine that learns which search engines to query, AI Magazine, Vol. 18., No. 2, 1997.]]

[33]

S. Lawrence, C. L. Giles, Inquirus, the NECI meta search engine, In Proceedings of Seventh International World Wide Web Conference, Brisbane, Australia, 95--105, 1998.]]

[34]

B. U. Oztekin, G. Karypis, V. Kumar, Expert Agreement and Content Based Reranking in a Meta Search Environment using Mearf, In Proceedings of Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002.]]

[35]

S. D. Bay, M. Schwabacher: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, 29--38, 2003.]]

[36]

S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection Using the Local Correlation Integral. In Proceedings of IEEE International Conference on Data engineering, Bangalore, India March 2003.]]

[37]

P. Sun, S. Chawla, On Local Spatial Outliers, In Proceedings of Fourth IEEE International Conference on Data Mining (ICDM'04), Brighton, United Kingdom, November 2004.]]

[38]

L. Ertoz, Similarity Measures, PhD dissertation, University of Minnesota, in progress, 2005.]]

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

August 2005

844 pages

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bagging
  2. detection rate
  3. false alarm
  4. feature subsets
  5. integration
  6. outlier detection

Qualifiers

Conference

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Affiliations

Aleksandar Lazarevic

University of Minnesota, East Hartford, CT

Vipin Kumar

University of Minnesota, Minneapolis, MN