Feature Over-Selection (original) (raw)
Abstract
We propose probabilistic framework for analysis of inaccuracies due to feature selection (FS) when flawed estimates of performance of feature subsets are utilized. The approach is based on analysis of random search FS procedure and postulation that joint distribution of true and estimated classification errors is known a priori. We derive expected values for the FS bias, a difference between actual classification error after FS and classification error if ideal FS is performed according to exact estimates. The increase in true classification error due to inaccurate FS is comparable or even exceeds a training bias, a difference between generalization and Bayes errors. We have shown that there exists overfitting phenomenon in feature selection, entitled in this paper as feature over-selection. The effects of feature over-selection could be reduced if FS would be performed on basis of positional statistics. Theoretical results are supported by experiments carried out on simulated Gaussian data, as well as on high dimensional microarray gene expression data.
Chapter PDF
Similar content being viewed by others
References
- Hughes, G.F.: On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory IT-14, 55–63 (1965)
Google Scholar - Raudys, S.: On the problems of sample size in pattern recognition. In: Pugatchiov, V.S. (ed.) Detection, Pattern Recognition and Experiment Design. Proceedings of the 2nd All-Union Conference Statistical Methods in Control Theory, Nauka, Moscow, vol. 2, pp. 64–76 (1970) (in Russian)
Google Scholar - Kanal, L., Chandrasekaran, B.: On dimensionality and sample size in statistical pattern classification. Pattern Recognition 3, 238–255 (1971)
Article Google Scholar - Raudys, S.: Statistical and Neural Classifiers - An integrated approach to design. Springer, London (2001)
MATH Google Scholar - Haykin, S.: Neural Networks: A comprehensive foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)
MATH Google Scholar - Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Acad. Press, San Diego (1990)
Google Scholar - Raudys, S.J., Jain, A.K.: Small sample size effects in statistical pattern recognition: Recommendation for practitioners. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(3), 242–254 (1991)
Article Google Scholar - Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15(11), 1119–1125 (1994)
Article Google Scholar - Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. of Machine Learning Research 3, 1157–1182 (2003)
Article MATH Google Scholar - Raudys, S.: Classification errors when features are selected. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 38, pp. 9–26. Institute of Mathematics and Informatics, Vilnius (1979) (in Russian)
Google Scholar - Raudys, S.: Influence of sample size on the accuracy of model selection in pattern recognition. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 50, pp. 9–30. Institute of Mathematics and Informatics, Vilnius (1981) (in Russian)
Google Scholar - Murray, G.D.: A cautionary note on selection of variables in discriminant analysis. Appl. Statist. 26(3), 246–250 (1997)
Article Google Scholar - Ng, A.: Preventing overfitting of cross-validation data. In: Proc. of the Fourteenth International Conference on Machine Learning, pp. 245–253. Morgan Kaufman, San Francisco (1997)
Google Scholar - Ye, J.: On measuring and correcting the effects of data mining and model selection. J. of American Statistical Association 93(441), 120–131 (1998)
Article MATH Google Scholar - Domingos, P.: Process-oriented estimation of generalization error. In: Proceedings of the Sixteenth International, Joint Conf. on Art. Intell., pp. 714–722. Morgan Kaufmann, San Francisco (1999)
Google Scholar - Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99(10), 6562–6566 (2002)
Article MATH Google Scholar - Golub, T.R., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Article Google Scholar
Author information
Authors and Affiliations
- Vilnius Gediminas Technical University, Sauletekio 11, Vilnius, LT-10223, Lithuania
Sarunas Raudys
Editor information
Editors and Affiliations
- Hong Kong University of Science and Technology,
Dit-Yan Yeung - Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
James T. Kwok - Instituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal
Ana Fred - Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123, Cagliari, Italy
Fabio Roli - Faculty of Electrical Engineering, Mathematics and Computer Science, Information and Communication Theory Group, Delft University of Technology, Delft, The Netherlands
Dick de Ridder
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Raudys, S. (2006). Feature Over-Selection. In: Yeung, DY., Kwok, J.T., Fred, A., Roli, F., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2006. Lecture Notes in Computer Science, vol 4109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11815921\_68
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/11815921\_68
- Publisher Name: Springer, Berlin, Heidelberg
- Print ISBN: 978-3-540-37236-3
- Online ISBN: 978-3-540-37241-7
- eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.