Gustavo Orair - Academia.edu (original) (raw)
Papers by Gustavo Orair
A quantidade de informacoes armazenadas em bases de dados de documentos textuais aumenta cada vez... more A quantidade de informacoes armazenadas em bases de dados de documentos textuais aumenta cada vez mais. Esse crescimento demanda metodos automaticos para organizacao destes dados. Neste contexto, o estudo da classificacao automatica de textos tem merecido bastante atencao tanto no meio academico quanto no mercado. A maioria dos trabalhos sobre a classificacao estuda o desenvolvimento de tecnicas de classificacao de textos em que existem um numero limitado de classes e a dependencia entre as classes nao e expressiva. Existem varios cenarios de aplicacao relevantes em que estas premissas nao sao validas. Para solucionar tais problemas, um novo topico de pesquisa, a Classificacao Multi-rotulo Hierarquica (HMC) vem sendo continuamente estudado mas ainda representa um grande desafio para a area. Nos problemas de HMC, o conjunto de classes tende a ser muito maior e estas estao organizadas segundo uma estrutura hierarquica. Os metodos tradicionais, alem de ignorar o conhecimento existente ...
Detecting outlier patterns in data has been an important research topic in statistics, data minin... more Detecting outlier patterns in data has been an important research topic in statistics, data mining and machine learning communities for many years. Research in identifying effective solutions to this problem have several interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Among the different algorithms, statistical (parametric) approaches and distance-based outlier detection are the most popular in use. The former is well grounded but often has difficulty scaling to large and high dimensional data. The latter is relatively efficient and empirically found to be effective on a number of domains but scalability is still an issue in spite of a fair bit of research on the topic. To address this limitation, in this work, we propose Atalaia, an efficient and scalable distance-based algorithm for detecting outliers in large high dimensional databases. Central to our algorithm is a fast strategy to estimate the unusualness of a record within the database and use a rank-ordered approach to evaluate records. Our algorithm partitions the database and ranks the objects that are candidates to be an outlier, reducing significantly the number of comparisons among objects. We evaluate different ranking heuristics in a comprehensive set of real and synthetic databases. Further, Atalaia also handles heterogeneous databases, i.e, those containing both categorical and continuous attributes. The results show that our algorithm outperforms by up to 73% the state-of-the-art distance-based outlier detection algorithm.
Proceedings of the VLDB Endowment, Sep 1, 2010
Detecting outliers in data is an important problem with interesting applications in a myriad of d... more Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.
... Logo, o Cobweb pode ser visto como a execuçao de uma tarefa para todo o que pertence ao conj... more ... Logo, o Cobweb pode ser visto como a execuçao de uma tarefa para todo o que pertence ao conjunto de objetos. Cada tarefa, por sua vez, se traduz em uma busca em profundidade (mais especificamente, uma busca heurıstica Hill Climbing) na árvore de conceitos. ...
International Journal of Parallel Programming, 2008
Analyzing gene expression patterns is becoming a highly relevant task in the Bioinformatics area.... more Analyzing gene expression patterns is becoming a highly relevant task in the Bioinformatics area. This analysis makes it possible to determine the behavior patterns of genes under various conditions, a fundamental information for treating diseases, among other applications. A recent advance in this area is the Tricluster algorithm, which is the first algorithm capable of determining 3D clusters (genes × samples × timestamps), that is, groups of genes that behave similarly across samples and timestamps. However, even though biological experiments collect an increasing amount of data to be analyzed and correlated, the triclustering problem remains a bottleneck due to its NP-Completeness, so its parallelization seems to be an essential step towards obtaining feasible solutions. In this work we propose and evaluate the implementation of a parallel version of the Tricluster algorithm using the filter-labeled-stream paradigm supported by the Anthill parallel programming environment. The results show that our parallelization scales well with the data size, being able to handle severe load imbalances that are inherent to the problem. Further more, the parallelization strategy is applicable to any depth-first searches.
Proceedings of The Vldb Endowment, 2010
Detecting outliers in data is an important problem with interesting applications in a myriad of d... more Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches.
Analyzing gene expression patterns is becoming a highly relevant task in the bio informatics area... more Analyzing gene expression patterns is becoming a highly relevant task in the bio informatics area. This analysis makes it possible to determine the behavior patterns of genes under various conditions, a fundamental information for treating diseases, among other applications. An advance in this area is the tricluster algorithm, which is the first algorithm capable of determining 3D clusters, that is, it determines clusters of sets of genes that behave similarly in a set of samples and set of time stamps. However, while biological experiments collect an increasing amount of data to be analyzed and correlated, the triclustering problem is NP-complete, and its parallelization seems to be an essential step towards obtaining feasible solutions. In this paper we propose and evaluate the implementation of a parallel version of the tricluster algorithm using the filter-labeled-stream paradigm supported by the Anthill parallel programming environment. The results show that our parallelization scales linearly with the data size. Further, the parallelization strategy is applicable to any depth-first searches
A quantidade de informacoes armazenadas em bases de dados de documentos textuais aumenta cada vez... more A quantidade de informacoes armazenadas em bases de dados de documentos textuais aumenta cada vez mais. Esse crescimento demanda metodos automaticos para organizacao destes dados. Neste contexto, o estudo da classificacao automatica de textos tem merecido bastante atencao tanto no meio academico quanto no mercado. A maioria dos trabalhos sobre a classificacao estuda o desenvolvimento de tecnicas de classificacao de textos em que existem um numero limitado de classes e a dependencia entre as classes nao e expressiva. Existem varios cenarios de aplicacao relevantes em que estas premissas nao sao validas. Para solucionar tais problemas, um novo topico de pesquisa, a Classificacao Multi-rotulo Hierarquica (HMC) vem sendo continuamente estudado mas ainda representa um grande desafio para a area. Nos problemas de HMC, o conjunto de classes tende a ser muito maior e estas estao organizadas segundo uma estrutura hierarquica. Os metodos tradicionais, alem de ignorar o conhecimento existente ...
Detecting outlier patterns in data has been an important research topic in statistics, data minin... more Detecting outlier patterns in data has been an important research topic in statistics, data mining and machine learning communities for many years. Research in identifying effective solutions to this problem have several interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Among the different algorithms, statistical (parametric) approaches and distance-based outlier detection are the most popular in use. The former is well grounded but often has difficulty scaling to large and high dimensional data. The latter is relatively efficient and empirically found to be effective on a number of domains but scalability is still an issue in spite of a fair bit of research on the topic. To address this limitation, in this work, we propose Atalaia, an efficient and scalable distance-based algorithm for detecting outliers in large high dimensional databases. Central to our algorithm is a fast strategy to estimate the unusualness of a record within the database and use a rank-ordered approach to evaluate records. Our algorithm partitions the database and ranks the objects that are candidates to be an outlier, reducing significantly the number of comparisons among objects. We evaluate different ranking heuristics in a comprehensive set of real and synthetic databases. Further, Atalaia also handles heterogeneous databases, i.e, those containing both categorical and continuous attributes. The results show that our algorithm outperforms by up to 73% the state-of-the-art distance-based outlier detection algorithm.
Proceedings of the VLDB Endowment, Sep 1, 2010
Detecting outliers in data is an important problem with interesting applications in a myriad of d... more Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches. In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms. Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.
... Logo, o Cobweb pode ser visto como a execuçao de uma tarefa para todo o que pertence ao conj... more ... Logo, o Cobweb pode ser visto como a execuçao de uma tarefa para todo o que pertence ao conjunto de objetos. Cada tarefa, por sua vez, se traduz em uma busca em profundidade (mais especificamente, uma busca heurıstica Hill Climbing) na árvore de conceitos. ...
International Journal of Parallel Programming, 2008
Analyzing gene expression patterns is becoming a highly relevant task in the Bioinformatics area.... more Analyzing gene expression patterns is becoming a highly relevant task in the Bioinformatics area. This analysis makes it possible to determine the behavior patterns of genes under various conditions, a fundamental information for treating diseases, among other applications. A recent advance in this area is the Tricluster algorithm, which is the first algorithm capable of determining 3D clusters (genes × samples × timestamps), that is, groups of genes that behave similarly across samples and timestamps. However, even though biological experiments collect an increasing amount of data to be analyzed and correlated, the triclustering problem remains a bottleneck due to its NP-Completeness, so its parallelization seems to be an essential step towards obtaining feasible solutions. In this work we propose and evaluate the implementation of a parallel version of the Tricluster algorithm using the filter-labeled-stream paradigm supported by the Anthill parallel programming environment. The results show that our parallelization scales well with the data size, being able to handle severe load imbalances that are inherent to the problem. Further more, the parallelization strategy is applicable to any depth-first searches.
Proceedings of The Vldb Endowment, 2010
Detecting outliers in data is an important problem with interesting applications in a myriad of d... more Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches.
Analyzing gene expression patterns is becoming a highly relevant task in the bio informatics area... more Analyzing gene expression patterns is becoming a highly relevant task in the bio informatics area. This analysis makes it possible to determine the behavior patterns of genes under various conditions, a fundamental information for treating diseases, among other applications. An advance in this area is the tricluster algorithm, which is the first algorithm capable of determining 3D clusters, that is, it determines clusters of sets of genes that behave similarly in a set of samples and set of time stamps. However, while biological experiments collect an increasing amount of data to be analyzed and correlated, the triclustering problem is NP-complete, and its parallelization seems to be an essential step towards obtaining feasible solutions. In this paper we propose and evaluate the implementation of a parallel version of the tricluster algorithm using the filter-labeled-stream paradigm supported by the Anthill parallel programming environment. The results show that our parallelization scales linearly with the data size. Further, the parallelization strategy is applicable to any depth-first searches