Ricardo Cerri | Universidade de São Paulo (original) (raw)

Papers by Ricardo Cerri

2018 7th Brazilian Conference on Intelligent Systems (BRACIS), 2018

Transposable Elements (TEs) are DNA sequences capable of changing the gene's activity through... more Transposable Elements (TEs) are DNA sequences capable of changing the gene's activity through transposition within the cells of a host. Once TEs insert themselves in other genes, they can change or reduce the activity of certain proteins, which in some cases could unfeasible the survival of such organisms or even provide genetic variability. A variety of methods has been proposed for the identification and classification of TEs, but most of them still involve a lot of manual work or are too class-specific, which restricts its applicability. Besides, the classes involved in such problems are often hierarchically structured, which is ignored by most of these methods. In this scenario, one problem that still needs further investigation is the use of strategies for selecting positive and negative instances during the induction of hierarchical models. Therefore, in this paper we explore four distinct strategies for selecting training instances, making use of several Machine Learning classifiers with different biases which were applied to the Hierarchical Classification of TEs using a local approach. Thus, we recommend the best strategy based on the results experimentally obtained.

Advances in Intelligent Data Analysis XIX, 2021

2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 2019

In Multi-Label Stream Classification (MLSC) examples arriving in a stream can be simultaneously c... more In Multi-Label Stream Classification (MLSC) examples arriving in a stream can be simultaneously classified into multiple classes. This is a very challenging task, especially considering that new classes can emerge during the stream (Concept Evolution), and known classes can change over time (Concept Drift). In real situations, these characteristics come together with a scenario with Infinitely Delayed Labels, where we can never access the true class labels of the examples to update classifiers. In order to overcome these issues, this paper proposes a new method called MultI-label learNing Algorithm for Data Streams with Binary Relevance transformation (MINAS-BR). Our proposal uses a new Novelty Detection (ND) procedure to detect concept evolution and concept drift, being updated in an unsupervised fashion. We also propose a new methodology to evaluate MLSC methods in scenarios with Infinitely Delayed Labels. Experiments over synthetic data sets attested the potential of MINAS-BR, which was able to adapt to different concept drift and concept evolution scenarios, obtaining superior or competitive performances in comparison to literature baselines.

Progress in Artificial Intelligence, 2019

Machine Learning and Knowledge Discovery in Databases, 2021

One of the most challenging machine learning problems is a particular case of data classification... more One of the most challenging machine learning problems is a particular case of data classification in which classes are hierarchically structured and objects can be assigned to multiple paths of the class hierarchy at the same time. This task is known as hierarchical multi-label classification (HMC), with applications in text classification, image annotation, and in bioinformatics problems such as protein function prediction. In this paper, we propose novel neural network architectures for HMC called HMCN, capable of simultaneously optimizing local and global loss functions for discovering local hierarchical class-relationships and global information from the entire class hierarchy while penalizing hierarchical violations. We evaluate its performance in 21 datasets from four distinct domains, and we compare it against the current HMC state-of-the-art approaches. Results show that HMCN substantially outperforms all baselines with statistical significance, arising as the novel state-of...

In this article, some complex real world problems are described, which are faced by industry and ... more In this article, some complex real world problems are described, which are faced by industry and academy, for which Machine Learning methods are used. This study also presents classification methods used in many research areas, such as Bioinformatics. In addition, a literature review of Machine Learning methods applied to agriculture and livestock was performed. Finally, the study presents some research groups of Embrapa, the academy and the industry that use Machine Learning method.

ArXiv, 2018

Machine learning algorithms often contain many hyperparameters whose values affect the predictive... more Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations, and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive accuracy. However, we lack insight into how to efficiently explore this vast space of configurations: which are the best optimization techniques, how should we use them, and how significant is their effect on predictive or runtime performance? This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning on three Decision Tree induction algorithms, CART, C4.5 and CTree. These algorithms were selected because they are based on similar principles, have presented a high predictive performance in several previous works and induce interpretable classification models. Additionally, they contain many inte...

J. Inf. Data Manag., 2018

In traditional classification an instance is assigned to one class within a small set of classes.... more In traditional classification an instance is assigned to one class within a small set of classes. However, there are problems where an instance is related to many classes hierarchically structured, known as Hierarchical Classification (HC), which is present in many domains like Text Categorization, Music Genre Classification and Bioinformatics. A topic that has gained attention recently is the classification of Transposable Elements (TEs), which are DNA sequences capable of moving inside the genome. TEs have a great importance in the genetic variability of species, since they can modify the functionality of host genes. Despite the research relevance, just a few tools perform its automatic classification and most of them do not use more elaborated strategies, like using Machine Learning to learn models from data. Moreover, the interpretability of these methods is still an issue. In this work, we extend the original study that proposed the global method HC-GA, presenting some improvem...

2017 International Joint Conference on Neural Networks (IJCNN), 2017

Multi-label classification is a machine learning task where instances can be classified into two ... more Multi-label classification is a machine learning task where instances can be classified into two or more labels simultaneously. In this task, there exist correlations between the instances belonging to same or similar sets of labels. This paper proposes the incorporation of instance correlations by modifying the multi-label datasets. We used the label-space to create new features, which represent these correlations. The original and modified datasets were used with different multi-label classification methods. Experiments have shown that better results can be obtained when instance correlations were incorporated in the classification tasks. All methods were evaluated with measures specifically designed for multi-label problems.

The Neotropical region is the richest in bioluminescent Coleoptera species, however, its biolumin... more The Neotropical region is the richest in bioluminescent Coleoptera species, however, its bioluminescence megadiversity is still underexplored in terms of genomic organization and evolution, mainly within the Phengodidae family. The railroad worm Phrixothrix hirtus is an important biological model and symbolic species due to its bicolor bioluminescence, being the only organism that produces true red light among bioluminescent terrestrial species. Here, we performed the partial genome assembly of P. hirtus, combining short and long reads generated with Illumina sequencing, providing an important source of genomic information and a framework for comparative genomic analyses for the evaluation of the bioluminescent system in Elateroidea. The estimated genome size has ∼3.4Gb, 32% of GC content, and 67% of repetitive elements, being the largest genome described in the Elateroidea superfamily. Several events of gene family expansions associated with anatomical development and morphogenesis...

Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, 2020

Information Sciences, 2021

Abstract In multi-target prediction, an instance has to be classified along multiple target varia... more Abstract In multi-target prediction, an instance has to be classified along multiple target variables at the same time, where each target represents a category or numerical value. There are several strategies to tackle multi-target prediction problems: the local strategy learns a separate model for each target variable independently, while the global strategy learns a single model for all target variables together. Previous studies suggested that the global strategy should be preferred because (1) learning is more efficient, (2) the learned models are more compact, and (3) it overfits much less than the local strategy, as it is harder to overfit on several targets at the same time than on one target. However, it is not clear whether the global strategy exploits correlations between the targets optimally. In this paper, we investigate whether better results can be obtained by learning multiple multi-target models on several partitions of the targets. To answer this question, we first determined alternative partitions using an exhaustive search strategy and a strategy based on a genetic algorithm, and then compared the results of the global and local strategies against these. We used decision trees and random forests as base models. The results show that it is possible to outperform global and local approaches, but finding a good partition without incurring in overfitting remains a challenging task.

Sensors and Actuators B: Chemical, 2021

Abstract In this paper, we report on machine learning to analyze the capacitance spectra obtained... more Abstract In this paper, we report on machine learning to analyze the capacitance spectra obtained with an electronic tongue (e-tongue) and discriminate three endocrine-disrupting chemicals (EDC): bisphenol A, estrone, and 17-β-estradiol, and their mixtures. The e-tongue comprised seven sensing units made with interdigitated gold electrodes coated with layer-by-layer films of poly(o-methoxy aniline), poly(3-thiophene acetic acid), and molybdenum disulfide (MoS2). The Multilayer Perceptron (MLP), Random Forest, and Extreme Gradient Boosting (XGBoost) models were applied for multi-target regression to predict the concentration of individual contaminants and their mixtures. These machine learning models were evaluated according to the root mean square error (RMSE) values. The best performance was achieved with XGBoost for which RMSE ranged from 0.19 to 3.37 for individual contaminants, from 0.12 to 0.25 for the mixtures, and from 0.34 to 3.46 for the entire dataset. The high performance was only possible with a multi-target regression strategy, including a feature selection procedure. In the latter, the data were plotted with the parallel coordinate technique, and the silhouette coefficient was calculated, which is a quantitative measure of the ability to distinguish similar samples in a dataset. The usefulness of the machine learning methods is demonstrated by noting that the data from mixtures of EDCs could not be distinguished using multidimensional projections. Also significant is that this combination of machine learning and information visualization methodology is entirely generic; it may be applied to analyze data from e-tongues and other sensing and biosensing devices in prediction tasks as demanding as in the discrimination of mixtures of EDCs at concentrations below nmol L−1.

Proceedings of the 36th Annual ACM Symposium on Applied Computing, 2021

Several algorithms have been proposed for offline multi-label classification. However, applicatio... more Several algorithms have been proposed for offline multi-label classification. However, applications in areas such as traffic monitoring, social networks, and sensors produce data continuously, the so called data streams, posing challenges to batch multi-label learning. With the lack of stationarity in the distribution of data streams, new algorithms are needed to online adapt to such changes (concept drift). Also, in realistic applications, changes occur in scenarios with infinitely delayed labels, where the true classes of the arrival instances are never available. We propose an online unsupervised incremental method based on self-organizing maps for multi-label stream classification in scenarios with infinitely delayed labels. We consider the existence of an initial set of labeled instances to train a self-organizing map for each label. The learned models are then used and adapted in an evolving stream to classify new instances, considering that their classes will never be available. We adapt to incremental concept drifts by online updating the weight vectors of winner neurons and the dataset label cardinality. Predictions are obtained using the Bayes rule and the outputs of each neuron, adapting the prior probabilities and conditional probabilities of the classes in the stream. Experiments using synthetic and real datasets show that our method is highly competitive with several ones from the literature, in both stationary and concept drift scenarios.

2021 International Joint Conference on Neural Networks (IJCNN), 2021

Recent works on Multi-Label Classification (MLC) present multiple strategies to explore label cor... more Recent works on Multi-Label Classification (MLC) present multiple strategies to explore label correlations in a way to improve classifiers performances. However, these works focus only in the traditional local and global approaches, i.e., transforming the original problem into a set of binary local problems, or dealing globally with all classes simultaneously. Very few works have investigated strategies to use label correlations in order to partition the label space in a different ways. While in local partitions several binary classifiers are used (one per label), global partitions use only one classifier to deal with all labels. On the contrary, here we propose a strategy that explores the correlations between labels to partition the label space aiming to find partitions in-between (hybrid) the local and global ones. We believe in-between local and global partitions better cluster similar labels, improving the multi-label classifiers ability to explore label correlations. We compared the hybrid partitions with global, local and random generated partitions. Our experimental results showed that the hybrid partitions lead to competitive results and, in general, were slightly better than global and local partitions. The random partitions were also competitive with the global and local partitions, showing that the current local and global approaches still need improvements in order to consider label correlations.

2021 International Joint Conference on Neural Networks (IJCNN), 2021

Multi-target learning is a prediction task where each example is associated with multiple target ... more Multi-target learning is a prediction task where each example is associated with multiple target variables (outputs) simultaneously. One of the challenges in this research field is related to the high dimensionality of the data and the high number of target variables with dependencies. In such scenarios, it is crucial to extract lower dimensional representations from the original input space, such that these can be provided as input to other multi-target predictors. In this paper, we proposed using Autoencoders as feature extractors in several multi-target classification datasets publicly available. Results were evaluated considering state-of-the-art multi-target classification methods and evaluation measures in the literature. The experiments showed that the neural networks were able to keep the predictive performance even when the extracted features corresponded to a dimension size equivalent to 10% of the original number of features and, in some cases, getting better results than when using the original datasets.

Expert Systems with Applications, 2021

Abstract In recent years, the interest in interpretable classification models has grown. One of t... more Abstract In recent years, the interest in interpretable classification models has grown. One of the proposed ways to improve the interpretability of a classification model based on collections of crisp rules is to use sets (unordered collections) instead of lists (ordered collections). One of the problems associated with sets is that multiple rules may cover a single instance but predict different classes for it, thus requiring a conflict resolution strategy. In this work, we propose two algorithms capable of finding feature-space regions inside which any created rule would be consistent with the already existing rules, preventing inconsistencies from arising. Our algorithms, named CFSGS and CFSBE, do not generate classification rules nor classification models but are instead meant to enhance algorithms that do so, such as Learning Classifier Systems. We analyzed both algorithms from a theoretical perspective and conducted experiments with a proof of concept evolutionary algorithm that employs CFSBE. The experiments suggest that using CFSBE as an embedded tool does incur a computational overhead, but such cost is not prohibitive.

Applied Soft Computing, 2019

2018 7th Brazilian Conference on Intelligent Systems (BRACIS), 2018

Advances in Intelligent Data Analysis XIX, 2021

2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 2019

Progress in Artificial Intelligence, 2019

Machine Learning and Knowledge Discovery in Databases, 2021

ArXiv, 2018

J. Inf. Data Manag., 2018

2017 International Joint Conference on Neural Networks (IJCNN), 2017

Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, 2020

Information Sciences, 2021

Sensors and Actuators B: Chemical, 2021

Proceedings of the 36th Annual ACM Symposium on Applied Computing, 2021

2021 International Joint Conference on Neural Networks (IJCNN), 2021

Expert Systems with Applications, 2021

Applied Soft Computing, 2019