Feature Over-Selection (original) (raw)

Abstract

We propose probabilistic framework for analysis of inaccuracies due to feature selection (FS) when flawed estimates of performance of feature subsets are utilized. The approach is based on analysis of random search FS procedure and postulation that joint distribution of true and estimated classification errors is known a priori. We derive expected values for the FS bias, a difference between actual classification error after FS and classification error if ideal FS is performed according to exact estimates. The increase in true classification error due to inaccurate FS is comparable or even exceeds a training bias, a difference between generalization and Bayes errors. We have shown that there exists overfitting phenomenon in feature selection, entitled in this paper as feature over-selection. The effects of feature over-selection could be reduced if FS would be performed on basis of positional statistics. Theoretical results are supported by experiments carried out on simulated Gaussian data, as well as on high dimensional microarray gene expression data.

Chapter PDF

Similar content being viewed by others

References

  1. Hughes, G.F.: On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory IT-14, 55–63 (1965)
    Google Scholar
  2. Raudys, S.: On the problems of sample size in pattern recognition. In: Pugatchiov, V.S. (ed.) Detection, Pattern Recognition and Experiment Design. Proceedings of the 2nd All-Union Conference Statistical Methods in Control Theory, Nauka, Moscow, vol. 2, pp. 64–76 (1970) (in Russian)
    Google Scholar
  3. Kanal, L., Chandrasekaran, B.: On dimensionality and sample size in statistical pattern classification. Pattern Recognition 3, 238–255 (1971)
    Article Google Scholar
  4. Raudys, S.: Statistical and Neural Classifiers - An integrated approach to design. Springer, London (2001)
    MATH Google Scholar
  5. Haykin, S.: Neural Networks: A comprehensive foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)
    MATH Google Scholar
  6. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Acad. Press, San Diego (1990)
    Google Scholar
  7. Raudys, S.J., Jain, A.K.: Small sample size effects in statistical pattern recognition: Recommendation for practitioners. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(3), 242–254 (1991)
    Article Google Scholar
  8. Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15(11), 1119–1125 (1994)
    Article Google Scholar
  9. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. of Machine Learning Research 3, 1157–1182 (2003)
    Article MATH Google Scholar
  10. Raudys, S.: Classification errors when features are selected. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 38, pp. 9–26. Institute of Mathematics and Informatics, Vilnius (1979) (in Russian)
    Google Scholar
  11. Raudys, S.: Influence of sample size on the accuracy of model selection in pattern recognition. In: Raudys, S. (ed.) Statistical Problems of Control, vol. 50, pp. 9–30. Institute of Mathematics and Informatics, Vilnius (1981) (in Russian)
    Google Scholar
  12. Murray, G.D.: A cautionary note on selection of variables in discriminant analysis. Appl. Statist. 26(3), 246–250 (1997)
    Article Google Scholar
  13. Ng, A.: Preventing overfitting of cross-validation data. In: Proc. of the Fourteenth International Conference on Machine Learning, pp. 245–253. Morgan Kaufman, San Francisco (1997)
    Google Scholar
  14. Ye, J.: On measuring and correcting the effects of data mining and model selection. J. of American Statistical Association 93(441), 120–131 (1998)
    Article MATH Google Scholar
  15. Domingos, P.: Process-oriented estimation of generalization error. In: Proceedings of the Sixteenth International, Joint Conf. on Art. Intell., pp. 714–722. Morgan Kaufmann, San Francisco (1999)
    Google Scholar
  16. Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99(10), 6562–6566 (2002)
    Article MATH Google Scholar
  17. Golub, T.R., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
    Article Google Scholar

Download references

Author information

Authors and Affiliations

  1. Vilnius Gediminas Technical University, Sauletekio 11, Vilnius, LT-10223, Lithuania
    Sarunas Raudys

Editor information

Editors and Affiliations

  1. Hong Kong University of Science and Technology,
    Dit-Yan Yeung
  2. Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
    James T. Kwok
  3. Instituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal
    Ana Fred
  4. Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123, Cagliari, Italy
    Fabio Roli
  5. Faculty of Electrical Engineering, Mathematics and Computer Science, Information and Communication Theory Group, Delft University of Technology, Delft, The Netherlands
    Dick de Ridder

Rights and permissions

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Raudys, S. (2006). Feature Over-Selection. In: Yeung, DY., Kwok, J.T., Fred, A., Roli, F., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2006. Lecture Notes in Computer Science, vol 4109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11815921\_68

Download citation

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Publish with us