Suchi Vora - Academia.edu (original) (raw)
Papers by Suchi Vora
2017 Computing Conference, 2017
Feature selection has been routinely used as a preprocessing step to remove irrelevant features a... more Feature selection has been routinely used as a preprocessing step to remove irrelevant features and conquer the "curse of dimensionality". In contrast to dimensionality reduction techniques such as PCA, the resulting features from feature selection are selected from the original feature space; hence, easy to interpret. A large host of feature selection algorithms has been proposed in the literature. This has created a critical issue: which algorithm should one use? Moreover, how does a feature selection method affect the performance of a given classification algorithm? This paper addresses these issues by (1) presenting an open source software system that integrates eleven feature selection algorithms and five common classifiers; and (2) systematically comparing and evaluating the selected features and their impact over these five classifiers using five datasets. Specifically, this system includes ten commonly adopted filter-based feature selection algorithms: ChiSquare, Information Gain, Fisher Score, Gini Index, Kruskal-Wallis, Laplacian Score, ReliefF, FCBF, CFS, and mRmR. It also includes one state-of-the-art embedded approach built upon Random Forests. The five classifiers are SVM, Random Forests, Naïve Bayes, kNN and C4.5 Decision Tree. Comprehensive evaluations consisting of around 1000 experiments were conducted over five text datasets. Several approximately equivalent groups (AEG), where algorithms in the same group select highly similar features, have been identified. Suitable feature-selection-classifier combinations have also been identified. For example, Chi-square and Information Gain form an AEG. Furthermore, Gini Index or Kruskal-Wallis together with SVM often produces classification performance that is comparable with or better than using all the original features. Such results will provide empirical guidelines for the data analytic community. The above software system is available at https://www.dropbox.com/sh/ryw23s52e98uhrv/AAANpc0JU 4X6r3Sfv4qB5ERna?dl=0
2017 Computing Conference, 2017
Feature selection has been routinely used as a preprocessing step to remove irrelevant features a... more Feature selection has been routinely used as a preprocessing step to remove irrelevant features and conquer the "curse of dimensionality". In contrast to dimensionality reduction techniques such as PCA, the resulting features from feature selection are selected from the original feature space; hence, easy to interpret. A large host of feature selection algorithms has been proposed in the literature. This has created a critical issue: which algorithm should one use? Moreover, how does a feature selection method affect the performance of a given classification algorithm? This paper addresses these issues by (1) presenting an open source software system that integrates eleven feature selection algorithms and five common classifiers; and (2) systematically comparing and evaluating the selected features and their impact over these five classifiers using five datasets. Specifically, this system includes ten commonly adopted filter-based feature selection algorithms: ChiSquare, Information Gain, Fisher Score, Gini Index, Kruskal-Wallis, Laplacian Score, ReliefF, FCBF, CFS, and mRmR. It also includes one state-of-the-art embedded approach built upon Random Forests. The five classifiers are SVM, Random Forests, Naïve Bayes, kNN and C4.5 Decision Tree. Comprehensive evaluations consisting of around 1000 experiments were conducted over five text datasets. Several approximately equivalent groups (AEG), where algorithms in the same group select highly similar features, have been identified. Suitable feature-selection-classifier combinations have also been identified. For example, Chi-square and Information Gain form an AEG. Furthermore, Gini Index or Kruskal-Wallis together with SVM often produces classification performance that is comparable with or better than using all the original features. Such results will provide empirical guidelines for the data analytic community. The above software system is available at https://www.dropbox.com/sh/ryw23s52e98uhrv/AAANpc0JU 4X6r3Sfv4qB5ERna?dl=0