Renato Moraes Silva | Universidade Estadual de Campinas (original) (raw)
Papers by Renato Moraes Silva
Knowledge Based Systems, 2020
Applied Soft Computing, Nov 1, 2020
Anais do XII Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg 2012)
Web spamming é um dos principais problemas que afeta a qualidade das ferramentas de busca. O núme... more Web spamming é um dos principais problemas que afeta a qualidade das ferramentas de busca. O número de páginas web que usam esta técnica para conseguir melhores posições nos resultados de busca é cada vez maior. A principal motivação são os lucros obtidos com o mercado de publicidade online, além de ataques a usuários da Internet por meio de malwares, que roubam informações para facilitar roubos bancários. Diante disso, esse trabalho apresenta uma análise de técnicas de aprendizagem de máquina aplicadas na detecção de spam hosts. Experimentos realizados com uma base de dados real, pública e de grande porte indicam que as técnicas de agregação de métodos baseados em árvores são promissoras na tarefa de detecção de spam hosts.
iSys, 2018
Spam filtering in online instant messages and SMS is a challenging problem nowadays. It is becaus... more Spam filtering in online instant messages and SMS is a challenging problem nowadays. It is because the messages are often very short and rife with slangs, idioms, symbols, emoticons, and abbreviations which hamper predicting and knowledge discovering. In order to face this problem, we evaluated a simple, fast, scalable, multiclass, and online text classification method based on the minimum description length principle. We conducted experiments using a real and public dataset, which demonstrate that our method is effective on instant messaging and SMS spam filtering in both online and offline learning contexts.
2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 2019
The growing number of textual documents currently available on electronic media and on the Intern... more The growing number of textual documents currently available on electronic media and on the Internet imposes real challenges for many applications that demand searching and content analysis. As a consequence, text classification has emerged as a field of great interest in machine learning in the last decade. In many real-world applications, textual documents can naturally be labeled in different categories, and moreover, the value of their features can change over time requiring learning approaches with the ability to adjust their hypothesis in a very efficient way. Therefore, online learning and multilabel classification are in the spotlight nowadays, since very few currently available approaches are able to handle such problems simultaneously without requiring problem transformation. In this study, we propose a new multilabel text classification method based on the minimum description length principle that can be applied to real-world, dynamic, and large-scale problems because it d...
Web spams has become a major problem for Internet users, causing personal and economic losses. Fo... more Web spams has become a major problem for Internet users, causing personal and economic losses. Fortunately, several methods have been proposed in the literature for automatic detection of this plague. However, the constant improvement of techniques used by spammers requires that the filtering approa- ches be more generic, efficient and with high capacity of adaptation. Given this scenario, this paper presents a performance evaluation of multi-layer percep- tron artificial neural networks employed to solve such a problem.
The increasing popularity and reach of SMS and instant messages services through mobile devices h... more The increasing popularity and reach of SMS and instant messages services through mobile devices have attracted the attention of spammers who indiscriminately send messages which, besides being annoying, can also cause financial loss to the users. In this paper, we present a text classifier based on the minimum description length principle which is suitable for detecting spam disseminated in short and noisy text messages. The proposed approach supports incremental learning and, therefore, its predictive model can adapt to continuously evolving spamming techniques. We conducted experiments using a real and public dataset, which demonstrate that our approach is effective on spam filtering for short text messages in both online and offline learning contexts. Resumo. A crescente popularidade e alcance dos serviços de SMS e mensagens instantâneas compartilhadas através de dispositivos móveis vêm atraindo a atenção de spammers que indiscriminadamente enviam mensagens que, além de aborreced...
iSys, 2012
A web vem se tornando cada vez mais importante para seus usuarios, tanto como fonte de diversao, ... more A web vem se tornando cada vez mais importante para seus usuarios, tanto como fonte de diversao, comunicacao, pesquisa, noticias e comercio. Consequentemente, os sites concorrem entre si para atrair a atencao dos usuarios, sendo que muitos ganham maior visibilidade atraves de estrategias que enganam os motores de busca. Esses sites, conhecidos como web spam, causam prejuizos pessoais e economicos aos usuarios. Diante desse cenario, este trabalho apresenta uma analise de desempenho de diversas tecnicas de aprendizagem de maquina aplicadas na deteccao automatica de servidores web que propagam web spam. Por meio de uma validacao estatistica dos resultados observou-se as tecnicas de bagging de arvores de decisao, redes neurais perceptron de multiplas camadas, floresta aleatoria e boosting adaptativo de arvores de decisao sao promissoras na tarefa de deteccao de spam hosts.
2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018
In the last years, spammers have taken advantage of the popularity of electronic media to spread ... more In the last years, spammers have taken advantage of the popularity of electronic media to spread undesired text messages. These may cause direct and indirect damages, such as dissatisfaction and exposure of users to misleading information and malicious content that can result in significant financial losses. Automatic filtering short text messages is a challenging problem nowadays because labeled datasets generally contain few instances and messages may have an insufficient amount of terms to be classified. In addition, the messages are rife with abbreviations, slang, and misspelled words making it difficult to generate a good computational representation. In this study, we propose an automatic data augmentation technique to increase the number of labeled instances and to improve the quality of the computational representation of short and noise text messages. We also proposed an ensemble approach to combine the predictions obtained by the classifiers using the messages generated by this technique. Experiments with three text representation techniques demonstrated that the ensemble approach improves the results obtained in the detection of undesired short text messages when the number of training instances is small.
2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016
The steady growth and popularization of the Web has led spammers to develop techniques to circumv... more The steady growth and popularization of the Web has led spammers to develop techniques to circumvent search engines aiming good visibility to their web pages in search results. They are responsible for serious problems such as dissatisfaction, irritation, exposure to unpleasant or malicious content, and financial loss. Despite different machine learning approaches have been used to detect web spam, many of them suffer with the curse of dimensionality or require a very high computational cost impeding their employment in real scenarios. In this way, there is still a big effort to develop more advanced methods that at the same time are able to prevent overfitting and fast to learn. To fill this gap, we present the MDLClass, a classifier technique based on the minimum description length principle, applied to the context of web spam filtering. The proposed method is very efficient, lightweight, multi-class, and fast. We also evaluated a new approach to detect web spam that combines the predictions obtained by the classifiers using content-based, link-based, and transformed link-based features. In our experiments, we employed two real, public and large datasets: the WEBSPAM-UK2006 and the WEBSPAM-UK2007. The results indicate that the proposed MDLClass and ensemble of predictions using different types of features are promising in the task of web spam filtering.
Applied Soft Computing, 2020
Anais do XV Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2018), 2018
In this paper, we present a new public and real dataset of labeled images of meteors and non-mete... more In this paper, we present a new public and real dataset of labeled images of meteors and non-meteors that we recently used in a machine learning competition. We also present a comprehensive performance evaluation of several established machine learning methods and compare the results with a stacking approach – one of the winning solutions of the competition. We compared the performance obtained by the methods in the traditional repeated five-fold cross-validation with the ones obtained using the training and test partitions used in the competition. A careful analysis of the results indicates that, in general, the stacking based approach obtained the best performances compared to the baselines. Moreover, we found evidence that the validation strategy used by the platform that hosted the competition can lead to results that do not sustain in a cross-validation setup, which is recommendable in real-world scenarios.
Expert Systems with Applications, 2017
Knowledge-Based Systems, 2017
Abstract In many areas, the volume of text information is increasing rapidly, thereby demanding e... more Abstract In many areas, the volume of text information is increasing rapidly, thereby demanding efficient text classification approaches. Several methods are available at present, but most exhibit declining performance as the dimensionality of the problem increases, or they incur high computational costs for training, which limit their application in real scenarios. Thus, it is necessary to develop a method that can process high dimensional data in a rapid manner. In this study, we propose the MDLText , an efficient, lightweight, scalable, and fast multinomial text classifier, which is based on the minimum description length principle. MDLText exhibits fast incremental learning as well as being sufficiently robust to prevent overfitting, which are desirable features in real-world applications, large-scale problems, and online scenarios. Our experiments were carefully designed to ensure that we obtained statistically sound results, which demonstrated that the proposed approach achieves a good balance between predictive power and computational efficiency.
2012 11th International Conference on Machine Learning and Applications, 2012
Revista Brasileira de Computação Aplicada, 2012
iSys - Brazilian Journal of Information Systems, 2017
Muitos usuários do YouTube produzem conteúdo regularmente e fazem desta tarefa seu principal meio... more Muitos usuários do YouTube produzem conteúdo regularmente e fazem desta tarefa seu principal meio de vida. Contudo, esse sucesso vem despertando a atenção de usuários mal-intencionados, que propagam comentários indesejados para se autopromoverem ou para disseminar links maliciosos. Neste cenário, métodos tradicionais de categorização de texto podem sofrer limitações devido às características inerentes ao problema: (1) os comentários costumam ser curtos e mal redigidos e (2) o problema de classificação é naturalmente online. Este artigo avalia um método de classificação baseado no princípio da descrição mais simples e compara os resultados com os de métodos tradicionais de aprendizado online. Também é proposta uma técnica ensemble, que combina os métodos de classificação com diferentes técnicas de processamento de linguagem natural. Os experimentos foram cuidadosamente realizados e a análise estatística dos resultados indica que a técnica proposta obteve desempenho superior ao obtido...
Knowledge Based Systems, 2020
Applied Soft Computing, Nov 1, 2020
Anais do XII Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg 2012)
Web spamming é um dos principais problemas que afeta a qualidade das ferramentas de busca. O núme... more Web spamming é um dos principais problemas que afeta a qualidade das ferramentas de busca. O número de páginas web que usam esta técnica para conseguir melhores posições nos resultados de busca é cada vez maior. A principal motivação são os lucros obtidos com o mercado de publicidade online, além de ataques a usuários da Internet por meio de malwares, que roubam informações para facilitar roubos bancários. Diante disso, esse trabalho apresenta uma análise de técnicas de aprendizagem de máquina aplicadas na detecção de spam hosts. Experimentos realizados com uma base de dados real, pública e de grande porte indicam que as técnicas de agregação de métodos baseados em árvores são promissoras na tarefa de detecção de spam hosts.
iSys, 2018
Spam filtering in online instant messages and SMS is a challenging problem nowadays. It is becaus... more Spam filtering in online instant messages and SMS is a challenging problem nowadays. It is because the messages are often very short and rife with slangs, idioms, symbols, emoticons, and abbreviations which hamper predicting and knowledge discovering. In order to face this problem, we evaluated a simple, fast, scalable, multiclass, and online text classification method based on the minimum description length principle. We conducted experiments using a real and public dataset, which demonstrate that our method is effective on instant messaging and SMS spam filtering in both online and offline learning contexts.
2019 8th Brazilian Conference on Intelligent Systems (BRACIS), 2019
The growing number of textual documents currently available on electronic media and on the Intern... more The growing number of textual documents currently available on electronic media and on the Internet imposes real challenges for many applications that demand searching and content analysis. As a consequence, text classification has emerged as a field of great interest in machine learning in the last decade. In many real-world applications, textual documents can naturally be labeled in different categories, and moreover, the value of their features can change over time requiring learning approaches with the ability to adjust their hypothesis in a very efficient way. Therefore, online learning and multilabel classification are in the spotlight nowadays, since very few currently available approaches are able to handle such problems simultaneously without requiring problem transformation. In this study, we propose a new multilabel text classification method based on the minimum description length principle that can be applied to real-world, dynamic, and large-scale problems because it d...
Web spams has become a major problem for Internet users, causing personal and economic losses. Fo... more Web spams has become a major problem for Internet users, causing personal and economic losses. Fortunately, several methods have been proposed in the literature for automatic detection of this plague. However, the constant improvement of techniques used by spammers requires that the filtering approa- ches be more generic, efficient and with high capacity of adaptation. Given this scenario, this paper presents a performance evaluation of multi-layer percep- tron artificial neural networks employed to solve such a problem.
The increasing popularity and reach of SMS and instant messages services through mobile devices h... more The increasing popularity and reach of SMS and instant messages services through mobile devices have attracted the attention of spammers who indiscriminately send messages which, besides being annoying, can also cause financial loss to the users. In this paper, we present a text classifier based on the minimum description length principle which is suitable for detecting spam disseminated in short and noisy text messages. The proposed approach supports incremental learning and, therefore, its predictive model can adapt to continuously evolving spamming techniques. We conducted experiments using a real and public dataset, which demonstrate that our approach is effective on spam filtering for short text messages in both online and offline learning contexts. Resumo. A crescente popularidade e alcance dos serviços de SMS e mensagens instantâneas compartilhadas através de dispositivos móveis vêm atraindo a atenção de spammers que indiscriminadamente enviam mensagens que, além de aborreced...
iSys, 2012
A web vem se tornando cada vez mais importante para seus usuarios, tanto como fonte de diversao, ... more A web vem se tornando cada vez mais importante para seus usuarios, tanto como fonte de diversao, comunicacao, pesquisa, noticias e comercio. Consequentemente, os sites concorrem entre si para atrair a atencao dos usuarios, sendo que muitos ganham maior visibilidade atraves de estrategias que enganam os motores de busca. Esses sites, conhecidos como web spam, causam prejuizos pessoais e economicos aos usuarios. Diante desse cenario, este trabalho apresenta uma analise de desempenho de diversas tecnicas de aprendizagem de maquina aplicadas na deteccao automatica de servidores web que propagam web spam. Por meio de uma validacao estatistica dos resultados observou-se as tecnicas de bagging de arvores de decisao, redes neurais perceptron de multiplas camadas, floresta aleatoria e boosting adaptativo de arvores de decisao sao promissoras na tarefa de deteccao de spam hosts.
2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018
In the last years, spammers have taken advantage of the popularity of electronic media to spread ... more In the last years, spammers have taken advantage of the popularity of electronic media to spread undesired text messages. These may cause direct and indirect damages, such as dissatisfaction and exposure of users to misleading information and malicious content that can result in significant financial losses. Automatic filtering short text messages is a challenging problem nowadays because labeled datasets generally contain few instances and messages may have an insufficient amount of terms to be classified. In addition, the messages are rife with abbreviations, slang, and misspelled words making it difficult to generate a good computational representation. In this study, we propose an automatic data augmentation technique to increase the number of labeled instances and to improve the quality of the computational representation of short and noise text messages. We also proposed an ensemble approach to combine the predictions obtained by the classifiers using the messages generated by this technique. Experiments with three text representation techniques demonstrated that the ensemble approach improves the results obtained in the detection of undesired short text messages when the number of training instances is small.
2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016
The steady growth and popularization of the Web has led spammers to develop techniques to circumv... more The steady growth and popularization of the Web has led spammers to develop techniques to circumvent search engines aiming good visibility to their web pages in search results. They are responsible for serious problems such as dissatisfaction, irritation, exposure to unpleasant or malicious content, and financial loss. Despite different machine learning approaches have been used to detect web spam, many of them suffer with the curse of dimensionality or require a very high computational cost impeding their employment in real scenarios. In this way, there is still a big effort to develop more advanced methods that at the same time are able to prevent overfitting and fast to learn. To fill this gap, we present the MDLClass, a classifier technique based on the minimum description length principle, applied to the context of web spam filtering. The proposed method is very efficient, lightweight, multi-class, and fast. We also evaluated a new approach to detect web spam that combines the predictions obtained by the classifiers using content-based, link-based, and transformed link-based features. In our experiments, we employed two real, public and large datasets: the WEBSPAM-UK2006 and the WEBSPAM-UK2007. The results indicate that the proposed MDLClass and ensemble of predictions using different types of features are promising in the task of web spam filtering.
Applied Soft Computing, 2020
Anais do XV Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2018), 2018
In this paper, we present a new public and real dataset of labeled images of meteors and non-mete... more In this paper, we present a new public and real dataset of labeled images of meteors and non-meteors that we recently used in a machine learning competition. We also present a comprehensive performance evaluation of several established machine learning methods and compare the results with a stacking approach – one of the winning solutions of the competition. We compared the performance obtained by the methods in the traditional repeated five-fold cross-validation with the ones obtained using the training and test partitions used in the competition. A careful analysis of the results indicates that, in general, the stacking based approach obtained the best performances compared to the baselines. Moreover, we found evidence that the validation strategy used by the platform that hosted the competition can lead to results that do not sustain in a cross-validation setup, which is recommendable in real-world scenarios.
Expert Systems with Applications, 2017
Knowledge-Based Systems, 2017
Abstract In many areas, the volume of text information is increasing rapidly, thereby demanding e... more Abstract In many areas, the volume of text information is increasing rapidly, thereby demanding efficient text classification approaches. Several methods are available at present, but most exhibit declining performance as the dimensionality of the problem increases, or they incur high computational costs for training, which limit their application in real scenarios. Thus, it is necessary to develop a method that can process high dimensional data in a rapid manner. In this study, we propose the MDLText , an efficient, lightweight, scalable, and fast multinomial text classifier, which is based on the minimum description length principle. MDLText exhibits fast incremental learning as well as being sufficiently robust to prevent overfitting, which are desirable features in real-world applications, large-scale problems, and online scenarios. Our experiments were carefully designed to ensure that we obtained statistically sound results, which demonstrated that the proposed approach achieves a good balance between predictive power and computational efficiency.
2012 11th International Conference on Machine Learning and Applications, 2012
Revista Brasileira de Computação Aplicada, 2012
iSys - Brazilian Journal of Information Systems, 2017
Muitos usuários do YouTube produzem conteúdo regularmente e fazem desta tarefa seu principal meio... more Muitos usuários do YouTube produzem conteúdo regularmente e fazem desta tarefa seu principal meio de vida. Contudo, esse sucesso vem despertando a atenção de usuários mal-intencionados, que propagam comentários indesejados para se autopromoverem ou para disseminar links maliciosos. Neste cenário, métodos tradicionais de categorização de texto podem sofrer limitações devido às características inerentes ao problema: (1) os comentários costumam ser curtos e mal redigidos e (2) o problema de classificação é naturalmente online. Este artigo avalia um método de classificação baseado no princípio da descrição mais simples e compara os resultados com os de métodos tradicionais de aprendizado online. Também é proposta uma técnica ensemble, que combina os métodos de classificação com diferentes técnicas de processamento de linguagem natural. Os experimentos foram cuidadosamente realizados e a análise estatística dos resultados indica que a técnica proposta obteve desempenho superior ao obtido...