Pablo Bermejo - Academia.edu (original) (raw)

Papers by Pablo Bermejo

Research paper thumbnail of Hipotiroidismo subclínico y riesgo cardiovascular

Nutricion hospitalaria: organo oficial de la Sociedad Espanola de Nutricion Parenteral y Enteral

Objective: To assess whether subclinical hypothyroidism can behave as a cardiovascular risk facto... more Objective: To assess whether subclinical hypothyroidism can behave as a cardiovascular risk factor or a modifier thereof, identifying epidemiological variables and estimated in a sample of patients diagnosed in the province of Albacete (Spain) cardiovascular risk. Methodology: Observational, descriptive study was carried out in Albacete during the first half of January 2012 in patients of both genders with subclinical hypothyroidism. The following variables were analyzed: Fasting glucose , total cholesterol , HDL cholesterol, LDL cholesterol , triglycerides , TSH , T4 , weight, height, Body Mass Index , blood pressure, a history of cardiovascular disease , cardiovascular risk factors and estimated cardiovascular risk. Results: 326 patients younger than 65 years at 78% without cardiovascular risk factors in 48.61 %, with female predominance (79.2 %). The prevalence of cardiovascular risk factors was identified: smoking (33.2 %), diabetes mellitus (24.9%), hypertension (23.4 %), lipid...

Research paper thumbnail of WEKA package for algorithm IWSS (Incremental Wrapper Subset Selection)

Java source code of the algorithm proposed in my paper: Improving Incremental Wrapper-Based Subse... more Java source code of the algorithm proposed in my paper: Improving Incremental Wrapper-Based Subset Selection via Replacement and Early Stopping. Pablo Bermejo, José A. Gámez, Jose Miguel Puerta International Journal of Pattern Recognition and Artificial Intelligence 01/2011; 25:605-625.

Research paper thumbnail of WEKA package developed for RerankingSearch algorithm

Java source code of algorithm proposed in my papers: Fast wrapper feature subset selection in hig... more Java source code of algorithm proposed in my papers: Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Pablo Bermejo, Luis de la Ossa, José A. Gámez, José Miguel Puerta Knowledge-Based Systems 01/2012; 25:35-44. and Improving Incremental Wrapper-Based Feature Subset Selection by Using Re-ranking. Pablo Bermejo, José A. Gámez, José Miguel Puerta Trends in Applied Intelligent Systems - 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1-4, 2010, Proceedings, Part I; 01/2010

Research paper thumbnail of balancing

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text cl... more E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-forders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers.

Research paper thumbnail of A GRASP algorithm for fast hybrid high-dimensional datasets

Pattern Recognition Letters

8 Feature subset selection is a key problem in the data-mining classification task that helps to ... more 8 Feature subset selection is a key problem in the data-mining classification task that helps to obtain more compact and understandable models without degrad-ing (or even improving) their performance. In this work we focus on FSS in high-dimensional datasets, that is, with a very large number of predictive at-tributes. In this case, standard sophisticated wrapper algorithms cannot be applied because of their complexity, and computationally lighter filter-wrapper algorithms have recently been proposed. In this work we propose a stochastic al-gorithm based on the GRASP meta-heuristic, with the main goal of speeding up the feature subset selection process, basically by reducing the number of wrap-per evaluations to carry out. GRASP is a multi-start constructive method which constructs a solution in its first stage, and then runs an improving stage over that solution. Several instances of the proposed GRASP method are experimentally tested and compared with state-of-the-art algorithms o...

Research paper thumbnail of WEKA package for algorithm Distribution Based Balance

Java source code for the proposed algorithm in my paper: Improving the performance of Naive Bayes... more Java source code for the proposed algorithm in my paper: Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Pablo Bermejo, José A. Gámez, Jose Miguel Puerta Expert Systems with Applications 01/2011; 38:2072-2080.

Research paper thumbnail of Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking

Research paper thumbnail of Comparison of Feature Construction Methods for Video Relevance Prediction

Multimedia Modeling, 2009

Abstract. Low level features of multimedia content often have limited power to discriminate a doc... more Abstract. Low level features of multimedia content often have limited power to discriminate a document’s relevance to a query. This motivated researchers to investigate other types of features. In this paper, we in- vestigated four groups of features: low-level object features, behavioural features, vocabulary features, and window-based vocabulary features, to predict the relevance of shots in video retrieval. Search logs

Research paper thumbnail of Attribute Construction for E-Mail Foldering by Using Wrappered Forward Greedy Search

Research paper thumbnail of Development of Interpretable Predictive Models for BPH and Prostate Cancer

Clinical Medicine Insights: Oncology, 2015

bAckground: Traditional methods for deciding whether to recommend a patient for a prostate biopsy... more bAckground: Traditional methods for deciding whether to recommend a patient for a prostate biopsy are based on cut-off levels of stand-alone markers such as prostate-specific antigen (PSA) or any of its derivatives. However, in the last decade we have seen the increasing use of predictive models that combine, in a non-linear manner, several predictives that are better able to predict prostate cancer (PC), but these fail to help the clinician to distinguish between PC and benign prostate hyperplasia (BPH) patients. We construct two new models that are capable of predicting both PC and BPH. Methods: An observational study was performed on 150 patients with PSA $3 ng/mL and age .50 years. We built a decision tree and a logistic regression model, validated with the leave-one-out methodology, in order to predict PC or BPH, or reject both. results: Statistical dependence with PC and BPH was found for prostate volume (P-value , 0.001), PSA (P-value , 0.001), international prostate symptom score (IPSS; P-value , 0.001), digital rectal examination (DRE; P-value , 0.001), age (P-value , 0.002), antecedents (P-value , 0.006), and meat consumption (P-value , 0.08). The two predictive models that were constructed selected a subset of these, namely, volume, PSA, DRE, and IPSS, obtaining an area under the ROC curve (AUC) between 72% and 80% for both PC and BPH prediction. conclusIon: PSA and volume together help to build predictive models that accurately distinguish among PC, BPH, and patients without any of these pathologies. Our decision tree and logistic regression models outperform the AUC obtained in the compared studies. Using these models as decision support, the number of unnecessary biopsies might be significantly reduced.

Research paper thumbnail of TESTOSTERONE PREDICTION FOR ERECTILE DYSFUNCTION AND HYPOGONADISM (ABSTRACT PUBLICATION)

British Journal of Surgery

Research paper thumbnail of Incremental wrapper-based subset selection with replacement: An advantageous alternative to sequential forward selection

This paper deals with the problem of wrapperbased feature subset selection in classification orie... more This paper deals with the problem of wrapperbased feature subset selection in classification oriented datasets with a (very) large number of attributes. In such datasets sophisticated search algorithms like beam search, branch and bound, best first, genetic algorithms, etc., become intractable in the wrapper approach due to the high number of wrapper evaluations to be carried out. One way to alleviate this problem is to use the so-called filter-wrapper approach or Incremental Wrapper-based Subset Selection (IWSS), which consists in the construction of a ranking among the predictive attributes by using a filter measure, and then a wrapper approach is used guided by the rank. In this way the number of wrapper evaluations is linear with the number of predictive attributes. In this paper we present a contribution to the IWSS approach which helps it to obtain more compact subsets, and consists into allow not only the addition of new attributes but also the interchange with some of the already included in the selected subset. The disadvantage of this novelty is that it grows up the worst-case complexity of IWSS up to O(n 2 ), however, as in the case of the well known sequential forward selection (SFS) the actual number of wrapper evaluations is considerably smaller. Empirical tests over 7 (biological) datasets with a large number of attributes demonstrate the success of the proposed approach when comparing with both IWSS and SFS.

Research paper thumbnail of Improving incremental wrapper-based feature subset selection by using re-ranking

This paper deals with the problem of supervised wrapper-based feature subset selection in dataset... more This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. In such datasets sophisticated search algorithms like beam search, branch and bound, best first, genetic algorithms, etc., become intractable in the wrapper approach due to the high number of wrapper evaluations to be carried out. Thus, recently we can

Research paper thumbnail of Improving incremental wrapper-based subset selection via replacement and early stopping

This paper deals with the problem of feature subset selection in classification-oriented datasets... more This paper deals with the problem of feature subset selection in classification-oriented datasets with a (very) large number of attributes. In such datasets complex classical wrapper approaches become intractable due to the high number of wrapper evaluations to be carried out. One way to alleviate this problem is to use the so-called filter-wrapper approach or Incremental Wrapper-based Subset Selection (IWSS), which consists of the construction of a ranking among the predictive attributes by using a filter measure, and then a wrapper approach is used by following the rank. In this way the number of wrapper evaluations is linear on the number of predictive attributes. In this paper we present two contributions to the IWSS approach. The first one is related with obtaining more compact subsets, and enables not only the addition of new attributes but also their interchange with some of those already included in the selected subset. Our second contribution, termed early stopping, sets an adaptive threshold on the number of attributes in the ranking to be considered. The advantages of these new approaches are analyzed both theoretically and experimentally. The results over a set of 12 high-dimensional datasets corroborate the success of our proposals.

Research paper thumbnail of Evaluation of a thermal-comfort control system using real data

There exist a wide number of works in the literature related to new systems devoted to manage the... more There exist a wide number of works in the literature related to new systems devoted to manage thermal control in buildings. Commonly, their evaluation is performed by using simulation of users and environmental conditions. Thus, in this work we choose a successful thermal-comfort system, formerly evaluated with simulations, and evaluate it by using data from project ASHRAE RP-884, which provides logs of real data coming from different buildings, in a wide variety of climates, and occupied by people with different thermal preferences. From these logs, we propose a pre-processing and evaluation methodology in order to achieve more realistic evaluations.

Research paper thumbnail of Enhancing Incremental Feature Subset Selection in High-Dimensional Databases by Adding a Backward Step

Computer and Information Sciences II, 2011

Research paper thumbnail of A study on different backward feature selection criteria over high-dimensional databases

2011 11th International Conference on Intelligent Systems Design and Applications, 2011

Feature subset selection has become an expensive process due to the relatively recent appearance ... more Feature subset selection has become an expensive process due to the relatively recent appearance of high-dimensional databases. Thus, not only the need has arisen for reducing the dimensionality of these datasets, but also for doing it in an efficient way. We propose a new backward search, where attributes are removed given several smart criteria found in the literature and, besides,

Research paper thumbnail of Global Feature Subset Selection on High-Dimensional Datasets Using Re-ranking-based EDAs

Lecture Notes in Computer Science, 2011

ABSTRACT The relatively recent appearance of high-dimensional databases has made traditional sear... more ABSTRACT The relatively recent appearance of high-dimensional databases has made traditional search algorithms too expensive in terms of time and memory resources. Thus, several modifications or enhancements to local search algorithms can be found in the literature to deal with this problem. However, nondeterministic global search, which is expected to perform better than local, still lacks appropriate adaptations or new developments for high-dimensional databases. We present a new non-deterministic iterative method which performs a global search and can easily handle datasets with high cardinality and, furthermore, it outperforms a wide variety of local search algorithms.

Research paper thumbnail of Improving KNN-based e-mail classification into folders generating class-balanced datasets

In this paper we deal with an e-mail classification problem known as e- mail foldering, which con... more In this paper we deal with an e-mail classification problem known as e- mail foldering, which consists on the classification of incoming mail into the dierent folders previously cre- ated by the user. This task has re- ceived less attention in the literature than spam filtering and is quite com- plex due to the (usually large) car- dinality (number of folders) and lack of balance (documents per class) of the class variable. On the other hand, proximity based algorithms have been used in a wide range of fields since decades ago. One of the main drawbacks of these classifiers, known as lazy classifiers, is their computational load due to their need to compute the distance of a new sample to each point in the vectorial space to decide which class it belongs to. This is why most of the devel- oped techniques for these classifiers consist on edition and condensation of the training set. In this work we make an approach to the problem of e-mail classification into folders. It is suggested...

Research paper thumbnail of Speeding up incremental wrapper feature subset selection with Naive Bayes classifier

Knowledge-Based Systems, 2014

This paper deals with the problem of wrapper feature subset selection (FSS) in classification-ori... more This paper deals with the problem of wrapper feature subset selection (FSS) in classification-oriented datasets with a (very) large number of attributes. In high-dimensional datasets with thousands of variables, wrapper FSS becomes a laborious computational process because of the amount of CPU time it requires. In this paper we study how under certain circumstances the wrapper FSS process can be speeded up by embedding the classifier into the wrapper algorithm, instead of dealing with it as a black-box. Our proposal is based on the combination of the NB classifier (which is known to be largely beneficial for FSS) with incremental wrapper FSS algorithms. The merit of this approach is analyzed both theoretically and experimentally, and the results show an impressive speed-up for the embedded FSS process.

Research paper thumbnail of Hipotiroidismo subclínico y riesgo cardiovascular

Nutricion hospitalaria: organo oficial de la Sociedad Espanola de Nutricion Parenteral y Enteral

Objective: To assess whether subclinical hypothyroidism can behave as a cardiovascular risk facto... more Objective: To assess whether subclinical hypothyroidism can behave as a cardiovascular risk factor or a modifier thereof, identifying epidemiological variables and estimated in a sample of patients diagnosed in the province of Albacete (Spain) cardiovascular risk. Methodology: Observational, descriptive study was carried out in Albacete during the first half of January 2012 in patients of both genders with subclinical hypothyroidism. The following variables were analyzed: Fasting glucose , total cholesterol , HDL cholesterol, LDL cholesterol , triglycerides , TSH , T4 , weight, height, Body Mass Index , blood pressure, a history of cardiovascular disease , cardiovascular risk factors and estimated cardiovascular risk. Results: 326 patients younger than 65 years at 78% without cardiovascular risk factors in 48.61 %, with female predominance (79.2 %). The prevalence of cardiovascular risk factors was identified: smoking (33.2 %), diabetes mellitus (24.9%), hypertension (23.4 %), lipid...

Research paper thumbnail of WEKA package for algorithm IWSS (Incremental Wrapper Subset Selection)

Java source code of the algorithm proposed in my paper: Improving Incremental Wrapper-Based Subse... more Java source code of the algorithm proposed in my paper: Improving Incremental Wrapper-Based Subset Selection via Replacement and Early Stopping. Pablo Bermejo, José A. Gámez, Jose Miguel Puerta International Journal of Pattern Recognition and Artificial Intelligence 01/2011; 25:605-625.

Research paper thumbnail of WEKA package developed for RerankingSearch algorithm

Java source code of algorithm proposed in my papers: Fast wrapper feature subset selection in hig... more Java source code of algorithm proposed in my papers: Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Pablo Bermejo, Luis de la Ossa, José A. Gámez, José Miguel Puerta Knowledge-Based Systems 01/2012; 25:35-44. and Improving Incremental Wrapper-Based Feature Subset Selection by Using Re-ranking. Pablo Bermejo, José A. Gámez, José Miguel Puerta Trends in Applied Intelligent Systems - 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1-4, 2010, Proceedings, Part I; 01/2010

Research paper thumbnail of balancing

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text cl... more E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-forders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers.

Research paper thumbnail of A GRASP algorithm for fast hybrid high-dimensional datasets

Pattern Recognition Letters

8 Feature subset selection is a key problem in the data-mining classification task that helps to ... more 8 Feature subset selection is a key problem in the data-mining classification task that helps to obtain more compact and understandable models without degrad-ing (or even improving) their performance. In this work we focus on FSS in high-dimensional datasets, that is, with a very large number of predictive at-tributes. In this case, standard sophisticated wrapper algorithms cannot be applied because of their complexity, and computationally lighter filter-wrapper algorithms have recently been proposed. In this work we propose a stochastic al-gorithm based on the GRASP meta-heuristic, with the main goal of speeding up the feature subset selection process, basically by reducing the number of wrap-per evaluations to carry out. GRASP is a multi-start constructive method which constructs a solution in its first stage, and then runs an improving stage over that solution. Several instances of the proposed GRASP method are experimentally tested and compared with state-of-the-art algorithms o...

Research paper thumbnail of WEKA package for algorithm Distribution Based Balance

Java source code for the proposed algorithm in my paper: Improving the performance of Naive Bayes... more Java source code for the proposed algorithm in my paper: Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Pablo Bermejo, José A. Gámez, Jose Miguel Puerta Expert Systems with Applications 01/2011; 38:2072-2080.

Research paper thumbnail of Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking

Research paper thumbnail of Comparison of Feature Construction Methods for Video Relevance Prediction

Multimedia Modeling, 2009

Abstract. Low level features of multimedia content often have limited power to discriminate a doc... more Abstract. Low level features of multimedia content often have limited power to discriminate a document’s relevance to a query. This motivated researchers to investigate other types of features. In this paper, we in- vestigated four groups of features: low-level object features, behavioural features, vocabulary features, and window-based vocabulary features, to predict the relevance of shots in video retrieval. Search logs

Research paper thumbnail of Attribute Construction for E-Mail Foldering by Using Wrappered Forward Greedy Search

Research paper thumbnail of Development of Interpretable Predictive Models for BPH and Prostate Cancer

Clinical Medicine Insights: Oncology, 2015

bAckground: Traditional methods for deciding whether to recommend a patient for a prostate biopsy... more bAckground: Traditional methods for deciding whether to recommend a patient for a prostate biopsy are based on cut-off levels of stand-alone markers such as prostate-specific antigen (PSA) or any of its derivatives. However, in the last decade we have seen the increasing use of predictive models that combine, in a non-linear manner, several predictives that are better able to predict prostate cancer (PC), but these fail to help the clinician to distinguish between PC and benign prostate hyperplasia (BPH) patients. We construct two new models that are capable of predicting both PC and BPH. Methods: An observational study was performed on 150 patients with PSA $3 ng/mL and age .50 years. We built a decision tree and a logistic regression model, validated with the leave-one-out methodology, in order to predict PC or BPH, or reject both. results: Statistical dependence with PC and BPH was found for prostate volume (P-value , 0.001), PSA (P-value , 0.001), international prostate symptom score (IPSS; P-value , 0.001), digital rectal examination (DRE; P-value , 0.001), age (P-value , 0.002), antecedents (P-value , 0.006), and meat consumption (P-value , 0.08). The two predictive models that were constructed selected a subset of these, namely, volume, PSA, DRE, and IPSS, obtaining an area under the ROC curve (AUC) between 72% and 80% for both PC and BPH prediction. conclusIon: PSA and volume together help to build predictive models that accurately distinguish among PC, BPH, and patients without any of these pathologies. Our decision tree and logistic regression models outperform the AUC obtained in the compared studies. Using these models as decision support, the number of unnecessary biopsies might be significantly reduced.

Research paper thumbnail of TESTOSTERONE PREDICTION FOR ERECTILE DYSFUNCTION AND HYPOGONADISM (ABSTRACT PUBLICATION)

British Journal of Surgery

Research paper thumbnail of Incremental wrapper-based subset selection with replacement: An advantageous alternative to sequential forward selection

This paper deals with the problem of wrapperbased feature subset selection in classification orie... more This paper deals with the problem of wrapperbased feature subset selection in classification oriented datasets with a (very) large number of attributes. In such datasets sophisticated search algorithms like beam search, branch and bound, best first, genetic algorithms, etc., become intractable in the wrapper approach due to the high number of wrapper evaluations to be carried out. One way to alleviate this problem is to use the so-called filter-wrapper approach or Incremental Wrapper-based Subset Selection (IWSS), which consists in the construction of a ranking among the predictive attributes by using a filter measure, and then a wrapper approach is used guided by the rank. In this way the number of wrapper evaluations is linear with the number of predictive attributes. In this paper we present a contribution to the IWSS approach which helps it to obtain more compact subsets, and consists into allow not only the addition of new attributes but also the interchange with some of the already included in the selected subset. The disadvantage of this novelty is that it grows up the worst-case complexity of IWSS up to O(n 2 ), however, as in the case of the well known sequential forward selection (SFS) the actual number of wrapper evaluations is considerably smaller. Empirical tests over 7 (biological) datasets with a large number of attributes demonstrate the success of the proposed approach when comparing with both IWSS and SFS.

Research paper thumbnail of Improving incremental wrapper-based feature subset selection by using re-ranking

This paper deals with the problem of supervised wrapper-based feature subset selection in dataset... more This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. In such datasets sophisticated search algorithms like beam search, branch and bound, best first, genetic algorithms, etc., become intractable in the wrapper approach due to the high number of wrapper evaluations to be carried out. Thus, recently we can

Research paper thumbnail of Improving incremental wrapper-based subset selection via replacement and early stopping

This paper deals with the problem of feature subset selection in classification-oriented datasets... more This paper deals with the problem of feature subset selection in classification-oriented datasets with a (very) large number of attributes. In such datasets complex classical wrapper approaches become intractable due to the high number of wrapper evaluations to be carried out. One way to alleviate this problem is to use the so-called filter-wrapper approach or Incremental Wrapper-based Subset Selection (IWSS), which consists of the construction of a ranking among the predictive attributes by using a filter measure, and then a wrapper approach is used by following the rank. In this way the number of wrapper evaluations is linear on the number of predictive attributes. In this paper we present two contributions to the IWSS approach. The first one is related with obtaining more compact subsets, and enables not only the addition of new attributes but also their interchange with some of those already included in the selected subset. Our second contribution, termed early stopping, sets an adaptive threshold on the number of attributes in the ranking to be considered. The advantages of these new approaches are analyzed both theoretically and experimentally. The results over a set of 12 high-dimensional datasets corroborate the success of our proposals.

Research paper thumbnail of Evaluation of a thermal-comfort control system using real data

There exist a wide number of works in the literature related to new systems devoted to manage the... more There exist a wide number of works in the literature related to new systems devoted to manage thermal control in buildings. Commonly, their evaluation is performed by using simulation of users and environmental conditions. Thus, in this work we choose a successful thermal-comfort system, formerly evaluated with simulations, and evaluate it by using data from project ASHRAE RP-884, which provides logs of real data coming from different buildings, in a wide variety of climates, and occupied by people with different thermal preferences. From these logs, we propose a pre-processing and evaluation methodology in order to achieve more realistic evaluations.

Research paper thumbnail of Enhancing Incremental Feature Subset Selection in High-Dimensional Databases by Adding a Backward Step

Computer and Information Sciences II, 2011

Research paper thumbnail of A study on different backward feature selection criteria over high-dimensional databases

2011 11th International Conference on Intelligent Systems Design and Applications, 2011

Feature subset selection has become an expensive process due to the relatively recent appearance ... more Feature subset selection has become an expensive process due to the relatively recent appearance of high-dimensional databases. Thus, not only the need has arisen for reducing the dimensionality of these datasets, but also for doing it in an efficient way. We propose a new backward search, where attributes are removed given several smart criteria found in the literature and, besides,

Research paper thumbnail of Global Feature Subset Selection on High-Dimensional Datasets Using Re-ranking-based EDAs

Lecture Notes in Computer Science, 2011

ABSTRACT The relatively recent appearance of high-dimensional databases has made traditional sear... more ABSTRACT The relatively recent appearance of high-dimensional databases has made traditional search algorithms too expensive in terms of time and memory resources. Thus, several modifications or enhancements to local search algorithms can be found in the literature to deal with this problem. However, nondeterministic global search, which is expected to perform better than local, still lacks appropriate adaptations or new developments for high-dimensional databases. We present a new non-deterministic iterative method which performs a global search and can easily handle datasets with high cardinality and, furthermore, it outperforms a wide variety of local search algorithms.

Research paper thumbnail of Improving KNN-based e-mail classification into folders generating class-balanced datasets

In this paper we deal with an e-mail classification problem known as e- mail foldering, which con... more In this paper we deal with an e-mail classification problem known as e- mail foldering, which consists on the classification of incoming mail into the dierent folders previously cre- ated by the user. This task has re- ceived less attention in the literature than spam filtering and is quite com- plex due to the (usually large) car- dinality (number of folders) and lack of balance (documents per class) of the class variable. On the other hand, proximity based algorithms have been used in a wide range of fields since decades ago. One of the main drawbacks of these classifiers, known as lazy classifiers, is their computational load due to their need to compute the distance of a new sample to each point in the vectorial space to decide which class it belongs to. This is why most of the devel- oped techniques for these classifiers consist on edition and condensation of the training set. In this work we make an approach to the problem of e-mail classification into folders. It is suggested...

Research paper thumbnail of Speeding up incremental wrapper feature subset selection with Naive Bayes classifier

Knowledge-Based Systems, 2014

This paper deals with the problem of wrapper feature subset selection (FSS) in classification-ori... more This paper deals with the problem of wrapper feature subset selection (FSS) in classification-oriented datasets with a (very) large number of attributes. In high-dimensional datasets with thousands of variables, wrapper FSS becomes a laborious computational process because of the amount of CPU time it requires. In this paper we study how under certain circumstances the wrapper FSS process can be speeded up by embedding the classifier into the wrapper algorithm, instead of dealing with it as a black-box. Our proposal is based on the combination of the NB classifier (which is known to be largely beneficial for FSS) with incremental wrapper FSS algorithms. The merit of this approach is analyzed both theoretically and experimentally, and the results show an impressive speed-up for the embedded FSS process.