Eduardo Hruschka - Academia.edu (original) (raw)
Papers by Eduardo Hruschka
In this paper, we elaborate on how feature selection methods traditionally used in classification... more In this paper, we elaborate on how feature selection methods traditionally used in classification problems can be adapted for clustering problems, assuming that the number of clusters is not known a priori. Computational complexity of each described algorithm is provided. Empirical results in six bioinformatics datasets illustrate that the adaptation of four well-known supervised methods for feature selection (correlation-based, consistency-based, wrapper of k-NN classifier, and C4.5) can be useful for clustering tasks.
Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2020), 2020
We propose an integrated framework, named Multi-Document Aspect-based Sentiment Extractive Summar... more We propose an integrated framework, named Multi-Document Aspect-based Sentiment Extractive Summarization (MD-ASES for short), to automatically generate extractive review summaries based on aspects of a large database with reviews of items such as films, businesses, and companies. Such summaries are got by extracting a subset of sentences as they are in the reviews, based on some relevance criteria. In MD-ASES, initially sentences are grouped in terms of aspects identified as predominant in the reviews. Then, sentences are selected by the similarity of the sentiment expressed about a particular aspect to the overall sentiment of the dataset reviews. Our results show that MD-ASES can successfully preserve the average sentiment of the reviews while including the most important aspects in the summary.
Intelligent Data Analysis, 2013
The problem of clustering with constraints has received considerable attention in the last decade... more The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have (partially) compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints-Constrained Vector Quantization Error (CVQE), its variant named LCVQE, and the Metric Pairwise Constrained K-Means (MPCK-Means)are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of (more specific) new experimental findings are discussed in the paper-e.g., deduced constraints usually do not help finding better data partitions.
Neurocomputing, 2016
The Traveling Salesman Problem (TSP) is one of the most studied optimization problems. Various me... more The Traveling Salesman Problem (TSP) is one of the most studied optimization problems. Various metaheuristics (MHs) have been proposed and investigated on many instances of this problem. It is widely accepted that the best MH varies for different instances. Ideally, one should be able to recommend the best MHs for a new TSP instance without having to execute them. However, this is a very difficult task. We address this task by using a meta-learning approach based on label ranking algorithms. These algorithms build a mapping that relates the characteristics of those instances (i.e., the meta-features) with the relative performance (i.e., the ranking) of MHs, based on (meta-)data extracted from TSP instances that have been already solved by those MHs. The success of this approach depends on the quality of the meta-features that describe the instances. In this work, we investigate four different sets of meta-features based on different measurements of the properties of TSP instances: edge and vertex measures, complex network measures, properties from the MHs, and subsampling landmarkers properties. The models are investigated in four different TSP scenarios presenting symmetry and connection strength variations. The experimental results indicate that meta-learning models can accurately predict rankings of MHs for different TSP scenarios. Good solutions for the investigated TSP instances can be obtained from the prediction of rankings of MHs, regardless of the learning algorithm used at the metalevel. The experimental results also show that the definition of the set of meta-features has an important impact on the quality of the solutions obtained.
Anais do 4. Congresso Brasileiro de Redes Neurais, 2016
The main challenge in using supervised neural networks in data mining applications means to get e... more The main challenge in using supervised neural networks in data mining applications means to get explicit knowledge from these models. For this purpose, an algorithm for rule extraction from artificial neural networks, based on the hidden units activation values, is developed. This algorithm, denominated Modified RX, was already evaluated in two benchmarks-Iris Plants database and Pima Indians Diabetes database-and the results were published previously. This work deals with the application of this algorithm to a dataset containing 10,000 examples of meteorological observations collected at the International Airport of Rio de Janeiro. Each example is represented by 38 attribute values and one associated class-wet fog or dry fog. Following the data preparation tasks-data representation, data selection and correlation analysis-a neural network is trained to model wet and dry fog conditions, and then the Modified RX Algorithm is used for rule extraction. The results obtained from the rule set provided by the algorithm are compared to those obtained from a classification tree.
Expert Systems with Applications, 2017
Several algorithms for clustering data streams based on k-Means have been proposed in the literat... more Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k , is known a priori by the user and can be kept fixed throughout the data analysis process. Besides the difficulty in choosing k , data stream clustering imposes several challenges to be addressed, such as addressing non-stationary, unbounded data that arrive in an online fashion. In this paper, we propose a Fast Evolutionary Algorithm for Clustering data streams (FEAC-Stream) that allows estimating k automatically from data in an online fashion. FEAC-Stream uses the Page-Hinkley Test to detect eventual degradation in the quality of the induced clusters, thereby triggering an evolutionary algorithm that re-estimates k accordingly. FEAC-Stream relies on the assumption that clusters of (partially unknown) data can provide useful information about the dynamics of the data stream. We illustrate the potential of FEAC-Stream in a set of experiments using both synthetic and real-world data streams, comparing it to four related algorithms, namely: CluStream-OMR k , CluStream-B k M, StreamKM ++-OMR k and StreamKM ++-B k M. The obtained results show that FEAC-Stream provides good data partitions and that it can detect, and accordingly react to, data changes.
International Journal of Hybrid Intelligent Systems, 2011
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014
We describe our approach for the SemEval-2014 task 9: Sentiment Analysis in Twitter. We make use ... more We describe our approach for the SemEval-2014 task 9: Sentiment Analysis in Twitter. We make use of an ensemble learning method for sentiment classification of tweets that relies on varied features such as feature hashing, part-of-speech, and lexical features. Our system was evaluated in the Twitter message-level task. This work is licensed under a Creative Commons Attribution 4.0 International Licence.
ACM Transactions on Knowledge Discovery from Data, 2014
Unsupervised models can provide supplementary soft constraints to help classify new “target” data... more Unsupervised models can provide supplementary soft constraints to help classify new “target” data because similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place, as in transfer learning settings. This article describes a general optimization framework that takes as input class membership estimates from existing classifiers learned on previously encountered “source” (or training) data, as well as a similarity matrix from a cluster ensemble operating solely on the target (or test) data to be classified, and yields a consensus labeling of the target data. More precisely, the application settings considered are nontransductive semisupervised and transfer learning scenarios where the training data are used only to build an ensemble of classifiers and are subsequently discarded before classifying the...
2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing, 2011
This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers an... more This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers and clusterers to perform semi-supervised and transductive learning. We consider scenarios where instances and their classification/clustering results are distributed across different data sites and have sharing restrictions. As a special case, the privacy aware computation of the model when instances of the target data are distributed across different data sites, is also discussed. Experimental results show that the proposed approach can provide good classification accuracies while adhering to the data/model sharing constraints.
Lecture Notes in Computer Science, 2013
This paper introduces two new frameworks, Doubly Supervised Latent Dirichlet Allocation (DSLDA) a... more This paper introduces two new frameworks, Doubly Supervised Latent Dirichlet Allocation (DSLDA) and its non-parametric variation (NP-DSLDA), that integrate two different types of supervision: topic labels and category labels. This approach is particularly useful for multitask learning, in which both latent and supervised topics are shared between multiple categories. Experimental results on both document and image classification show that both types of supervision improve the performance of both DSLDA and NP-DSLDA and that sharing both latent and supervised topics allows for better multitask learning.
Knowledge-Based Systems, 2015
Recommender Systems (RSs) are powerful and popular tools for e-commerce. To build their recommend... more Recommender Systems (RSs) are powerful and popular tools for e-commerce. To build their recommendations, RSs make use of varied data sources, which capture the characteristics of items, users, and their transactions. Despite recent advances in RS, the cold start problem is still a relevant issue that deserves further attention, and arises due to the lack of prior information about new users and new items. To minimize system degradation, a hybrid approach is presented that combines collaborative filtering recommendations with demographic information. The approach is based on an existing algorithm, SCOAL (Simultaneous Co-Clustering and Learning), and provides a hybrid recommendation approach that can address the (pure) cold start problem, where no collaborative information (ratings) is available for new users. Better predictions are produced from this relaxation of assumptions to replace the lack of information for the new user. Experiments using real-world datasets show the effectiveness of the proposed approach.
Proceedings of the 25th International Conference on Scientific and Statistical Database Management - SSDBM, 2013
ABSTRACT
Lecture Notes in Computer Science, 2003
This work proposes and evaluates a Nearest-Neighbor Method to substitute missing values in datase... more This work proposes and evaluates a Nearest-Neighbor Method to substitute missing values in datasets formed by continuous attributes. In the substitution process, each instance containing missing values is compared with complete instances, and the closest instance is used to assign the attribute missing value. We evaluate this method in simulations performed in four datasets that are usually employed as benchmarks for data mining methods-Iris Plants, Wisconsin Breast Cancer, Pima Indians Diabetes and Wine Recognition. First, we consider the substitution process as a prediction task. In this sense, we employ two metrics (Euclidean and Manhattan) to simulate substitutions both in original and normalized datasets. The obtained results were compared to those provided by a usually employed method to perform this task, i.e. substitution by the mean value. Based on these simulations, we propose a substitution procedure for the well-known K-Means Clustering Algorithm. Then, we perform clustering simulations, comparing the results obtained in the original datasets with the substituted ones. These results indicate that the proposed method is a suitable estimator for substituting missing values, i.e. it preserves the relationships between variables in the clustering process. Therefore, the proposed Nearest-Neighbor Method is an appropriate data preparation tool for the K-Means Clustering Algorithm.
Lecture Notes in Computer Science, 2005
The substitution of missing values, also called imputation, is an important data preparation task... more The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. This paper describes a nearest-neighbor method to impute missing values, showing that it can be useful for a clustering genetic algorithm. The proposed nearest-neighbor method is assessed by means of simulations performed in two datasets that are benchmarks for data mining methods: Wisconsin Breast Cancer and Congressional Voting Records. The efficacy of the proposed approach is evaluated both in prediction and clustering scenarios. Empirical results show that the employed imputation method is a suitable data preparation tool.
Lecture Notes in Computer Science, 2004
M. Danelutto, D. Laforenza, M. Vanneschi (Eds.): Euro-Par 2004, LNCS 3149, pp. 254262, 2004. © S... more M. Danelutto, D. Laforenza, M. Vanneschi (Eds.): Euro-Par 2004, LNCS 3149, pp. 254262, 2004. © Springer-Verlag Berlin Heidelberg 2004 ... A Scheduling Algorithm for Running Bag-of-Tasks Data Mining Applications on the Grid ... Fabrício AB da Silva, Sílvia ...
2012 Brazilian Symposium on Neural Networks, 2012
The disparity between the available amount of unlabeled and labeled data in several applications ... more The disparity between the available amount of unlabeled and labeled data in several applications made semisupervised learning become an active research topic. Most studies on semi-supervised clustering assume that the number of classes is equal to the number of clusters. This paper introduces a semi-supervised clustering algorithm, named Multiple Clusters per Class k-means (MCCK), which estimates the number of clusters per class via pairwise constraints generated from class labels. Experiments with eight datasets indicate that the algorithm outperforms three traditional algorithms for semisupervised clustering, especially when the one-cluster-per-class assumption does not hold. Finally, the learned structure can offer a valuable description of the data in several applications. For instance, it can aid the identification of subtypes of diseases in medical diagnosis problems.
Proceedings of the 2013 SIAM International Conference on Data Mining, 2013
Unsupervised models can provide supplementary soft constraints to help classify new target data u... more Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place. This paper describes a Bayesian framework that takes as input class labels from existing classifiers (designed based on labeled data from the source domain), as well as cluster labels from a cluster ensemble operating solely on the target data to be classified, and yields a consensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data. We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.
Proceedings - 10th International Conference on Machine Learning and Applications, ICMLA 2011, 2011
Several optimization methods can find good solutions for different instances of the Traveling Sal... more Several optimization methods can find good solutions for different instances of the Traveling Salesman Problem (TSP). Since there is no method that generates the best solution for all instances, the selection of the most promising method for a given TSP instance is a difficult task. This paper describes a meta-learning-based approach to select optimization methods for the TSP. Multilayer perceptron (MLP) networks are trained with TSP examples. These examples are described by a set of TSP characteristics and the cost of solutions obtained by a set of optimization methods. The trained MLP network model is then used to predict a ranking of these methods for a new TSP instance. Correlation measures are used to compare the predicted ranking with the ranking previously known. The obtained results suggest that the proposed approach is promising.
Lecture Notes in Computer Science, 2004
Data mining (DM) applications are composed of computing-intensive processing tasks working on hug... more Data mining (DM) applications are composed of computing-intensive processing tasks working on huge datasets. Due to its computing-intensive nature, these applications are natural candidates for execution on high performance, high throughput platforms such as PC clusters and computational grids. Many data mining algorithms can be implemented as bag-of-tasks (BoT) applications, i.e., parallel applications composed of independent tasks. This paper discusses the use of computing grids for the execution of DM algorithms as BoT applications, investigates the scalability of the execution of an application and proposes an approach to improve its scalability.
In this paper, we elaborate on how feature selection methods traditionally used in classification... more In this paper, we elaborate on how feature selection methods traditionally used in classification problems can be adapted for clustering problems, assuming that the number of clusters is not known a priori. Computational complexity of each described algorithm is provided. Empirical results in six bioinformatics datasets illustrate that the adaptation of four well-known supervised methods for feature selection (correlation-based, consistency-based, wrapper of k-NN classifier, and C4.5) can be useful for clustering tasks.
Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2020), 2020
We propose an integrated framework, named Multi-Document Aspect-based Sentiment Extractive Summar... more We propose an integrated framework, named Multi-Document Aspect-based Sentiment Extractive Summarization (MD-ASES for short), to automatically generate extractive review summaries based on aspects of a large database with reviews of items such as films, businesses, and companies. Such summaries are got by extracting a subset of sentences as they are in the reviews, based on some relevance criteria. In MD-ASES, initially sentences are grouped in terms of aspects identified as predominant in the reviews. Then, sentences are selected by the similarity of the sentiment expressed about a particular aspect to the overall sentiment of the dataset reviews. Our results show that MD-ASES can successfully preserve the average sentiment of the reviews while including the most important aspects in the summary.
Intelligent Data Analysis, 2013
The problem of clustering with constraints has received considerable attention in the last decade... more The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have (partially) compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints-Constrained Vector Quantization Error (CVQE), its variant named LCVQE, and the Metric Pairwise Constrained K-Means (MPCK-Means)are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of (more specific) new experimental findings are discussed in the paper-e.g., deduced constraints usually do not help finding better data partitions.
Neurocomputing, 2016
The Traveling Salesman Problem (TSP) is one of the most studied optimization problems. Various me... more The Traveling Salesman Problem (TSP) is one of the most studied optimization problems. Various metaheuristics (MHs) have been proposed and investigated on many instances of this problem. It is widely accepted that the best MH varies for different instances. Ideally, one should be able to recommend the best MHs for a new TSP instance without having to execute them. However, this is a very difficult task. We address this task by using a meta-learning approach based on label ranking algorithms. These algorithms build a mapping that relates the characteristics of those instances (i.e., the meta-features) with the relative performance (i.e., the ranking) of MHs, based on (meta-)data extracted from TSP instances that have been already solved by those MHs. The success of this approach depends on the quality of the meta-features that describe the instances. In this work, we investigate four different sets of meta-features based on different measurements of the properties of TSP instances: edge and vertex measures, complex network measures, properties from the MHs, and subsampling landmarkers properties. The models are investigated in four different TSP scenarios presenting symmetry and connection strength variations. The experimental results indicate that meta-learning models can accurately predict rankings of MHs for different TSP scenarios. Good solutions for the investigated TSP instances can be obtained from the prediction of rankings of MHs, regardless of the learning algorithm used at the metalevel. The experimental results also show that the definition of the set of meta-features has an important impact on the quality of the solutions obtained.
Anais do 4. Congresso Brasileiro de Redes Neurais, 2016
The main challenge in using supervised neural networks in data mining applications means to get e... more The main challenge in using supervised neural networks in data mining applications means to get explicit knowledge from these models. For this purpose, an algorithm for rule extraction from artificial neural networks, based on the hidden units activation values, is developed. This algorithm, denominated Modified RX, was already evaluated in two benchmarks-Iris Plants database and Pima Indians Diabetes database-and the results were published previously. This work deals with the application of this algorithm to a dataset containing 10,000 examples of meteorological observations collected at the International Airport of Rio de Janeiro. Each example is represented by 38 attribute values and one associated class-wet fog or dry fog. Following the data preparation tasks-data representation, data selection and correlation analysis-a neural network is trained to model wet and dry fog conditions, and then the Modified RX Algorithm is used for rule extraction. The results obtained from the rule set provided by the algorithm are compared to those obtained from a classification tree.
Expert Systems with Applications, 2017
Several algorithms for clustering data streams based on k-Means have been proposed in the literat... more Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k , is known a priori by the user and can be kept fixed throughout the data analysis process. Besides the difficulty in choosing k , data stream clustering imposes several challenges to be addressed, such as addressing non-stationary, unbounded data that arrive in an online fashion. In this paper, we propose a Fast Evolutionary Algorithm for Clustering data streams (FEAC-Stream) that allows estimating k automatically from data in an online fashion. FEAC-Stream uses the Page-Hinkley Test to detect eventual degradation in the quality of the induced clusters, thereby triggering an evolutionary algorithm that re-estimates k accordingly. FEAC-Stream relies on the assumption that clusters of (partially unknown) data can provide useful information about the dynamics of the data stream. We illustrate the potential of FEAC-Stream in a set of experiments using both synthetic and real-world data streams, comparing it to four related algorithms, namely: CluStream-OMR k , CluStream-B k M, StreamKM ++-OMR k and StreamKM ++-B k M. The obtained results show that FEAC-Stream provides good data partitions and that it can detect, and accordingly react to, data changes.
International Journal of Hybrid Intelligent Systems, 2011
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014
We describe our approach for the SemEval-2014 task 9: Sentiment Analysis in Twitter. We make use ... more We describe our approach for the SemEval-2014 task 9: Sentiment Analysis in Twitter. We make use of an ensemble learning method for sentiment classification of tweets that relies on varied features such as feature hashing, part-of-speech, and lexical features. Our system was evaluated in the Twitter message-level task. This work is licensed under a Creative Commons Attribution 4.0 International Licence.
ACM Transactions on Knowledge Discovery from Data, 2014
Unsupervised models can provide supplementary soft constraints to help classify new “target” data... more Unsupervised models can provide supplementary soft constraints to help classify new “target” data because similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place, as in transfer learning settings. This article describes a general optimization framework that takes as input class membership estimates from existing classifiers learned on previously encountered “source” (or training) data, as well as a similarity matrix from a cluster ensemble operating solely on the target (or test) data to be classified, and yields a consensus labeling of the target data. More precisely, the application settings considered are nontransductive semisupervised and transfer learning scenarios where the training data are used only to build an ensemble of classifiers and are subsequently discarded before classifying the...
2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing, 2011
This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers an... more This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers and clusterers to perform semi-supervised and transductive learning. We consider scenarios where instances and their classification/clustering results are distributed across different data sites and have sharing restrictions. As a special case, the privacy aware computation of the model when instances of the target data are distributed across different data sites, is also discussed. Experimental results show that the proposed approach can provide good classification accuracies while adhering to the data/model sharing constraints.
Lecture Notes in Computer Science, 2013
This paper introduces two new frameworks, Doubly Supervised Latent Dirichlet Allocation (DSLDA) a... more This paper introduces two new frameworks, Doubly Supervised Latent Dirichlet Allocation (DSLDA) and its non-parametric variation (NP-DSLDA), that integrate two different types of supervision: topic labels and category labels. This approach is particularly useful for multitask learning, in which both latent and supervised topics are shared between multiple categories. Experimental results on both document and image classification show that both types of supervision improve the performance of both DSLDA and NP-DSLDA and that sharing both latent and supervised topics allows for better multitask learning.
Knowledge-Based Systems, 2015
Recommender Systems (RSs) are powerful and popular tools for e-commerce. To build their recommend... more Recommender Systems (RSs) are powerful and popular tools for e-commerce. To build their recommendations, RSs make use of varied data sources, which capture the characteristics of items, users, and their transactions. Despite recent advances in RS, the cold start problem is still a relevant issue that deserves further attention, and arises due to the lack of prior information about new users and new items. To minimize system degradation, a hybrid approach is presented that combines collaborative filtering recommendations with demographic information. The approach is based on an existing algorithm, SCOAL (Simultaneous Co-Clustering and Learning), and provides a hybrid recommendation approach that can address the (pure) cold start problem, where no collaborative information (ratings) is available for new users. Better predictions are produced from this relaxation of assumptions to replace the lack of information for the new user. Experiments using real-world datasets show the effectiveness of the proposed approach.
Proceedings of the 25th International Conference on Scientific and Statistical Database Management - SSDBM, 2013
ABSTRACT
Lecture Notes in Computer Science, 2003
This work proposes and evaluates a Nearest-Neighbor Method to substitute missing values in datase... more This work proposes and evaluates a Nearest-Neighbor Method to substitute missing values in datasets formed by continuous attributes. In the substitution process, each instance containing missing values is compared with complete instances, and the closest instance is used to assign the attribute missing value. We evaluate this method in simulations performed in four datasets that are usually employed as benchmarks for data mining methods-Iris Plants, Wisconsin Breast Cancer, Pima Indians Diabetes and Wine Recognition. First, we consider the substitution process as a prediction task. In this sense, we employ two metrics (Euclidean and Manhattan) to simulate substitutions both in original and normalized datasets. The obtained results were compared to those provided by a usually employed method to perform this task, i.e. substitution by the mean value. Based on these simulations, we propose a substitution procedure for the well-known K-Means Clustering Algorithm. Then, we perform clustering simulations, comparing the results obtained in the original datasets with the substituted ones. These results indicate that the proposed method is a suitable estimator for substituting missing values, i.e. it preserves the relationships between variables in the clustering process. Therefore, the proposed Nearest-Neighbor Method is an appropriate data preparation tool for the K-Means Clustering Algorithm.
Lecture Notes in Computer Science, 2005
The substitution of missing values, also called imputation, is an important data preparation task... more The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. This paper describes a nearest-neighbor method to impute missing values, showing that it can be useful for a clustering genetic algorithm. The proposed nearest-neighbor method is assessed by means of simulations performed in two datasets that are benchmarks for data mining methods: Wisconsin Breast Cancer and Congressional Voting Records. The efficacy of the proposed approach is evaluated both in prediction and clustering scenarios. Empirical results show that the employed imputation method is a suitable data preparation tool.
Lecture Notes in Computer Science, 2004
M. Danelutto, D. Laforenza, M. Vanneschi (Eds.): Euro-Par 2004, LNCS 3149, pp. 254262, 2004. © S... more M. Danelutto, D. Laforenza, M. Vanneschi (Eds.): Euro-Par 2004, LNCS 3149, pp. 254262, 2004. © Springer-Verlag Berlin Heidelberg 2004 ... A Scheduling Algorithm for Running Bag-of-Tasks Data Mining Applications on the Grid ... Fabrício AB da Silva, Sílvia ...
2012 Brazilian Symposium on Neural Networks, 2012
The disparity between the available amount of unlabeled and labeled data in several applications ... more The disparity between the available amount of unlabeled and labeled data in several applications made semisupervised learning become an active research topic. Most studies on semi-supervised clustering assume that the number of classes is equal to the number of clusters. This paper introduces a semi-supervised clustering algorithm, named Multiple Clusters per Class k-means (MCCK), which estimates the number of clusters per class via pairwise constraints generated from class labels. Experiments with eight datasets indicate that the algorithm outperforms three traditional algorithms for semisupervised clustering, especially when the one-cluster-per-class assumption does not hold. Finally, the learned structure can offer a valuable description of the data in several applications. For instance, it can aid the identification of subtypes of diseases in medical diagnosis problems.
Proceedings of the 2013 SIAM International Conference on Data Mining, 2013
Unsupervised models can provide supplementary soft constraints to help classify new target data u... more Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place. This paper describes a Bayesian framework that takes as input class labels from existing classifiers (designed based on labeled data from the source domain), as well as cluster labels from a cluster ensemble operating solely on the target data to be classified, and yields a consensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data. We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.
Proceedings - 10th International Conference on Machine Learning and Applications, ICMLA 2011, 2011
Several optimization methods can find good solutions for different instances of the Traveling Sal... more Several optimization methods can find good solutions for different instances of the Traveling Salesman Problem (TSP). Since there is no method that generates the best solution for all instances, the selection of the most promising method for a given TSP instance is a difficult task. This paper describes a meta-learning-based approach to select optimization methods for the TSP. Multilayer perceptron (MLP) networks are trained with TSP examples. These examples are described by a set of TSP characteristics and the cost of solutions obtained by a set of optimization methods. The trained MLP network model is then used to predict a ranking of these methods for a new TSP instance. Correlation measures are used to compare the predicted ranking with the ranking previously known. The obtained results suggest that the proposed approach is promising.
Lecture Notes in Computer Science, 2004
Data mining (DM) applications are composed of computing-intensive processing tasks working on hug... more Data mining (DM) applications are composed of computing-intensive processing tasks working on huge datasets. Due to its computing-intensive nature, these applications are natural candidates for execution on high performance, high throughput platforms such as PC clusters and computational grids. Many data mining algorithms can be implemented as bag-of-tasks (BoT) applications, i.e., parallel applications composed of independent tasks. This paper discusses the use of computing grids for the execution of DM algorithms as BoT applications, investigates the scalability of the execution of an application and proposes an approach to improve its scalability.