Oner U Celepcikay | Rice University (original) (raw)
Uploads
Papers by Oner U Celepcikay
Lecture Notes in Computer Science, 2009
Existing data mining techniques mostly focus on finding global patterns and lack the ability to s... more Existing data mining techniques mostly focus on finding global patterns and lack the ability to systematically discover regional patterns. Most relationships in spatial datasets are regional; therefore there is a great need to extract regional knowledge from spatial datasets. This paper proposes a novel framework to discover interesting regions characterized by "strong regional correlation relationships" between attributes, and methods to analyze differences and similarities between regions. The framework employs a twophase approach: it first discovers regions by employing clustering algorithms that maximize a PCA-based fitness function and then applies post processing techniques to explain underlying regional structures and correlation patterns. Additionally, a new similarity measure that assesses the structural similarity of regions based on correlation sets is introduced. We evaluate our framework in a case study which centers on finding correlations between arsenic pollution and other factors in water wells and demonstrate that our framework effectively identifies regional correlation patterns.
Representative-based clustering algorithms form clusters by assigning objects to the closest clus... more Representative-based clustering algorithms form clusters by assigning objects to the closest cluster representative. On the one hand, they are quite popular due to their relative high speed and due to the fact that they are theoretically well understood. On the other hand, the clusters they can obtain are limited to spherical shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative cluster post-processing technique is proposed, which merges neighboring clusters greedily maximizing a given objective function and uses Gabriel graphs to determine which clusters are neighboring. Non-spherical shapes are approximated as the union of small spherical clusters that have been computed using a representative-based clustering algorithm. We claim that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Empirical studies were conducted to support this claim; for both trad...
Representative-based clustering algorithms form clusters by assigning objects to the closest clus... more Representative-based clustering algorithms form clusters by assigning objects to the closest cluster representative. On the one hand, they are quite popular due to their relative high speed and due to the fact that they are theoretically well understood. On the other hand, the clusters they can obtain are limited to spherical shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative cluster post-processing technique is proposed, which merges neighboring clusters greedily maximizing a given objective function and uses Gabriel graphs to determine which clusters are neighboring. Non-spherical shapes are approximated as the union of small spherical clusters that have been computed using a representative-based clustering algorithm. We claim that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Empirical studies were conducted to support this claim; for both trad...
Geo-referenced datasets are generated at quickly increasing rates, creating the need to develop t... more Geo-referenced datasets are generated at quickly increasing rates, creating the need to develop tools that extract knowledge from such datasets automatically. Traditional data mining techniques focus mostly on finding global patterns and lack the ability to systematically discover regional patterns. Finding interesting regional patterns is important because many patterns only exist at a regional level but not a global level. One of the main challenges in identifying such relationships is to discover regions that are interesting to domain experts and are capable of revealing such patterns. This dissertation focuses on developing methods to uncover hidden correlational patterns and developing regional regression tools that capture spatially varying relationships among attributes. First, we introduce a novel, PCA-based approach to discover interesting regions along with regional correlation patterns that exhibit strong relationships in the attribute space. Moreover, as a side product, a generic similarity measure for assessing the structural similarity between regions is proposed. Second, we propose a regional regression framework, called REG2, that discovers regional regression functions that are associated with contiguous areas in the subspace of the spatial attributes which we call regions. Third, in order to evaluate our proposed regional regression method and other geo-regression methods we propose various prediction evaluation measures capable of accurately assessing the performance of these techniques. Moreover, we developed several plug-in fitness functions that employ PCA, MC, regularization, and example weighting to improve capability of uncovering the underlying structure of data without assuming predetermined boundaries, such as zip codes or grids. The proposed frameworks are evaluated in case studies that center on indentifying causes of arsenic contamination in Texas water wells and on Boston Housing dataset determining spatially varying effects of house properties on house prices. The extensive experimental results show that our framework can effectively and efficiently identify regions with strong relations between dependent and independent variables, along with the regional regression functions which capture the spatial variation of attributes better than global models and other geo-regression models in building better models for prediction. We also show that besides providing better prediction, the discovered regions provided more insight into relations between variables. Finally we also show that using different evaluation measures based on the need of the domain experts can improve the prediction capabilities.
A Practice-based Model of STEM Teaching, 2015
Representative-based clustering algorithms are quite popular due to their relative high speed and... more Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. We evaluate MOSAIC for traditional unsupervised clustering with k-means and DBSCAN, and also for supervised clustering. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Given a suitable fitness function, MOSAIC is able to...
Representative-based clustering algorithms are quite popular due to their relative high speed and... more Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited toconvex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters. In addition, MOSAIC is capable of dealing with high dimensional data.
Lecture Notes in Computer Science, 2009
Existing data mining techniques mostly focus on finding global patterns and lack the ability to s... more Existing data mining techniques mostly focus on finding global patterns and lack the ability to systematically discover regional patterns. Most relationships in spatial datasets are regional; therefore there is a great need to extract regional knowledge from spatial datasets. This paper proposes a novel framework to discover interesting regions characterized by “strong regional correlation relationships” between attributes, and methods to
Innovative Methods and Applications, 2010
Representative-based clustering algorithms are quite popular due to their relative high speed and... more Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. This paper proposes post-processing techniques to alleviate this problem. In particular, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing an externally given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. We evaluate MOSAIC for traditional unsupervised clustering with k-means and DBSCAN, and also for supervised clustering. The experimental results show that the proposed post-processing techniques lead to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Moreover, given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters which are comparable to the ones generated by DBSCAN. In addition, MOSAIC is capable of dealing with high dimensional data. We also claim that MOSAIC can be employed as an effective post-processing clustering algorithm to further improve the quality of clustering.
This paper introduces Cougar^2, an innovative open source Java framework and toolset that assists... more This paper introduces Cougar^2, an innovative open source Java framework and toolset that assists researchers in designing, developing, and using machine learning and data mining algorithms. The primary mission for Cougar^2 is to provide an intuitive API to the research community with the abstraction and flexibility necessary to allow painless extension of the core framework. The Cougar^2 framework introduces and employs the Factory, Algorithm, and Model (FAM) paradigm which represents a novel combination of established object-oriented principles, design patterns, strategic abstraction, and domain knowledge geared for any machine learning or data mining task. Cougar^2 has been used successfully for both state of the art spatial data mining research (regional knowledge discovery and clustering) and as the main development tool in a data mining graduate course over the past two years.
computer.org
Yip Chi Lap Rezwan Ahmed Panayiotis Andreou Maria Andreou Anelia Angelova Maria-Luiza Antonie Gow... more Yip Chi Lap Rezwan Ahmed Panayiotis Andreou Maria Andreou Anelia Angelova Maria-Luiza Antonie Gowtham Atluri Mohammad Salahuddin Aziz Xiang Bai Jing Bai Alexander Behm Alessio Bertone Smriti Bhagat Runa Bhaumik Arnold Boedihardjo Mario Boley Shyam Boriah Serdar Bozdag Sandra Bringay Alexei Brodsky Fabian Buchwald Yundong Cai Deng Cai Guadalupe Canahuate Mustafa Canim Bin Cao Sinno Jialin Pan Bin Cao Nathan Liu Nicolas Cebron Aaron Cederquist Oner Ulvi Celepcikay Loic Cerf Varun Chandola Vineet Chaoji Rui Chen ...
Traditional regression analysis derives global relationships between variables and neglects spati... more Traditional regression analysis derives global relationships between variables and neglects spatial variations in variables. Hence they lack the ability to systematically discover regional relationships and to build better models that use this regional knowledge to obtain higher prediction accuracies. Since most relationships in spatial datasets are regional, there is a great need for regional regression methods that derive regional regression functions that reflect different spatial characteristics of different regions. This paper proposes a novel regional regression framework that first discovers interesting regions showing strong regional relationships between the dependent and the independent variables, and then builds a prediction model with a regional regression function associated with each region. Interesting regions are identified by running a representative-based clustering algorithm that maximizes an externally plugged in fitness function. In this work, we propose two fitness functions: an Rsquared based fitness function and an AIC-based fitness function to handle overfitting better. We evaluate our framework in two case studies; (1) identifying causes of arsenic contamination in Texas water wells and (2) Boston Housing dataset determining spatially varying effects of house properties on house prices. We demonstrated that our framework effectively identifies interesting regions and builds better prediction systems that rely on regional models.
Representative-based clustering algorithms form clusters by assigning objects to the closest clus... more Representative-based clustering algorithms form clusters by assigning objects to the closest cluster representative. On the one hand, they are quite popular due to their relative high speed and due to the fact that they are theoretically well understood. On the other hand, the clusters they can obtain are limited to spherical shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative cluster post-processing technique is proposed, which merges neighboring clusters greedily maximizing a given objective function and uses Gabriel graphs to determine which clusters are neighboring. Non-spherical shapes are approximated as the union of small spherical clusters that have been computed using a representative-based clustering algorithm. We claim that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Empirical studies were conducted to support this claim; for both trad...
Lecture Notes in Computer Science, 2009
Existing data mining techniques mostly focus on finding global patterns and lack the ability to s... more Existing data mining techniques mostly focus on finding global patterns and lack the ability to systematically discover regional patterns. Most relationships in spatial datasets are regional; therefore there is a great need to extract regional knowledge from spatial datasets. This paper proposes a novel framework to discover interesting regions characterized by "strong regional correlation relationships" between attributes, and methods to analyze differences and similarities between regions. The framework employs a twophase approach: it first discovers regions by employing clustering algorithms that maximize a PCA-based fitness function and then applies post processing techniques to explain underlying regional structures and correlation patterns. Additionally, a new similarity measure that assesses the structural similarity of regions based on correlation sets is introduced. We evaluate our framework in a case study which centers on finding correlations between arsenic pollution and other factors in water wells and demonstrate that our framework effectively identifies regional correlation patterns.
Representative-based clustering algorithms form clusters by assigning objects to the closest clus... more Representative-based clustering algorithms form clusters by assigning objects to the closest cluster representative. On the one hand, they are quite popular due to their relative high speed and due to the fact that they are theoretically well understood. On the other hand, the clusters they can obtain are limited to spherical shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative cluster post-processing technique is proposed, which merges neighboring clusters greedily maximizing a given objective function and uses Gabriel graphs to determine which clusters are neighboring. Non-spherical shapes are approximated as the union of small spherical clusters that have been computed using a representative-based clustering algorithm. We claim that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Empirical studies were conducted to support this claim; for both trad...
Representative-based clustering algorithms form clusters by assigning objects to the closest clus... more Representative-based clustering algorithms form clusters by assigning objects to the closest cluster representative. On the one hand, they are quite popular due to their relative high speed and due to the fact that they are theoretically well understood. On the other hand, the clusters they can obtain are limited to spherical shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative cluster post-processing technique is proposed, which merges neighboring clusters greedily maximizing a given objective function and uses Gabriel graphs to determine which clusters are neighboring. Non-spherical shapes are approximated as the union of small spherical clusters that have been computed using a representative-based clustering algorithm. We claim that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Empirical studies were conducted to support this claim; for both trad...
Geo-referenced datasets are generated at quickly increasing rates, creating the need to develop t... more Geo-referenced datasets are generated at quickly increasing rates, creating the need to develop tools that extract knowledge from such datasets automatically. Traditional data mining techniques focus mostly on finding global patterns and lack the ability to systematically discover regional patterns. Finding interesting regional patterns is important because many patterns only exist at a regional level but not a global level. One of the main challenges in identifying such relationships is to discover regions that are interesting to domain experts and are capable of revealing such patterns. This dissertation focuses on developing methods to uncover hidden correlational patterns and developing regional regression tools that capture spatially varying relationships among attributes. First, we introduce a novel, PCA-based approach to discover interesting regions along with regional correlation patterns that exhibit strong relationships in the attribute space. Moreover, as a side product, a generic similarity measure for assessing the structural similarity between regions is proposed. Second, we propose a regional regression framework, called REG2, that discovers regional regression functions that are associated with contiguous areas in the subspace of the spatial attributes which we call regions. Third, in order to evaluate our proposed regional regression method and other geo-regression methods we propose various prediction evaluation measures capable of accurately assessing the performance of these techniques. Moreover, we developed several plug-in fitness functions that employ PCA, MC, regularization, and example weighting to improve capability of uncovering the underlying structure of data without assuming predetermined boundaries, such as zip codes or grids. The proposed frameworks are evaluated in case studies that center on indentifying causes of arsenic contamination in Texas water wells and on Boston Housing dataset determining spatially varying effects of house properties on house prices. The extensive experimental results show that our framework can effectively and efficiently identify regions with strong relations between dependent and independent variables, along with the regional regression functions which capture the spatial variation of attributes better than global models and other geo-regression models in building better models for prediction. We also show that besides providing better prediction, the discovered regions provided more insight into relations between variables. Finally we also show that using different evaluation measures based on the need of the domain experts can improve the prediction capabilities.
A Practice-based Model of STEM Teaching, 2015
Representative-based clustering algorithms are quite popular due to their relative high speed and... more Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. We evaluate MOSAIC for traditional unsupervised clustering with k-means and DBSCAN, and also for supervised clustering. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Given a suitable fitness function, MOSAIC is able to...
Representative-based clustering algorithms are quite popular due to their relative high speed and... more Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited toconvex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters. In addition, MOSAIC is capable of dealing with high dimensional data.
Lecture Notes in Computer Science, 2009
Existing data mining techniques mostly focus on finding global patterns and lack the ability to s... more Existing data mining techniques mostly focus on finding global patterns and lack the ability to systematically discover regional patterns. Most relationships in spatial datasets are regional; therefore there is a great need to extract regional knowledge from spatial datasets. This paper proposes a novel framework to discover interesting regions characterized by “strong regional correlation relationships” between attributes, and methods to
Innovative Methods and Applications, 2010
Representative-based clustering algorithms are quite popular due to their relative high speed and... more Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. This paper proposes post-processing techniques to alleviate this problem. In particular, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing an externally given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. We evaluate MOSAIC for traditional unsupervised clustering with k-means and DBSCAN, and also for supervised clustering. The experimental results show that the proposed post-processing techniques lead to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Moreover, given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters which are comparable to the ones generated by DBSCAN. In addition, MOSAIC is capable of dealing with high dimensional data. We also claim that MOSAIC can be employed as an effective post-processing clustering algorithm to further improve the quality of clustering.
This paper introduces Cougar^2, an innovative open source Java framework and toolset that assists... more This paper introduces Cougar^2, an innovative open source Java framework and toolset that assists researchers in designing, developing, and using machine learning and data mining algorithms. The primary mission for Cougar^2 is to provide an intuitive API to the research community with the abstraction and flexibility necessary to allow painless extension of the core framework. The Cougar^2 framework introduces and employs the Factory, Algorithm, and Model (FAM) paradigm which represents a novel combination of established object-oriented principles, design patterns, strategic abstraction, and domain knowledge geared for any machine learning or data mining task. Cougar^2 has been used successfully for both state of the art spatial data mining research (regional knowledge discovery and clustering) and as the main development tool in a data mining graduate course over the past two years.
computer.org
Yip Chi Lap Rezwan Ahmed Panayiotis Andreou Maria Andreou Anelia Angelova Maria-Luiza Antonie Gow... more Yip Chi Lap Rezwan Ahmed Panayiotis Andreou Maria Andreou Anelia Angelova Maria-Luiza Antonie Gowtham Atluri Mohammad Salahuddin Aziz Xiang Bai Jing Bai Alexander Behm Alessio Bertone Smriti Bhagat Runa Bhaumik Arnold Boedihardjo Mario Boley Shyam Boriah Serdar Bozdag Sandra Bringay Alexei Brodsky Fabian Buchwald Yundong Cai Deng Cai Guadalupe Canahuate Mustafa Canim Bin Cao Sinno Jialin Pan Bin Cao Nathan Liu Nicolas Cebron Aaron Cederquist Oner Ulvi Celepcikay Loic Cerf Varun Chandola Vineet Chaoji Rui Chen ...
Traditional regression analysis derives global relationships between variables and neglects spati... more Traditional regression analysis derives global relationships between variables and neglects spatial variations in variables. Hence they lack the ability to systematically discover regional relationships and to build better models that use this regional knowledge to obtain higher prediction accuracies. Since most relationships in spatial datasets are regional, there is a great need for regional regression methods that derive regional regression functions that reflect different spatial characteristics of different regions. This paper proposes a novel regional regression framework that first discovers interesting regions showing strong regional relationships between the dependent and the independent variables, and then builds a prediction model with a regional regression function associated with each region. Interesting regions are identified by running a representative-based clustering algorithm that maximizes an externally plugged in fitness function. In this work, we propose two fitness functions: an Rsquared based fitness function and an AIC-based fitness function to handle overfitting better. We evaluate our framework in two case studies; (1) identifying causes of arsenic contamination in Texas water wells and (2) Boston Housing dataset determining spatially varying effects of house properties on house prices. We demonstrated that our framework effectively identifies interesting regions and builds better prediction systems that rely on regional models.
Representative-based clustering algorithms form clusters by assigning objects to the closest clus... more Representative-based clustering algorithms form clusters by assigning objects to the closest cluster representative. On the one hand, they are quite popular due to their relative high speed and due to the fact that they are theoretically well understood. On the other hand, the clusters they can obtain are limited to spherical shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative cluster post-processing technique is proposed, which merges neighboring clusters greedily maximizing a given objective function and uses Gabriel graphs to determine which clusters are neighboring. Non-spherical shapes are approximated as the union of small spherical clusters that have been computed using a representative-based clustering algorithm. We claim that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone. Empirical studies were conducted to support this claim; for both trad...