Dorothy ren - Academia.edu (original) (raw)

Papers by Dorothy ren

Fourth IEEE International Conference on Data Mining (ICDM'04)

ABSTRACT

Outlier detection can discover unexpected and interesting knowledge, which is often important to ... more Outlier detection can discover unexpected and interesting knowledge, which is often important to the information society. In this paper, we developed an efficient density-based outlier detection method for large datasets. In our method, outliers are efficiently detected from the candidate data subsets which contain potential outliers. Furthermore, the algorithm is implemented using a vertical data organization model, P-Tree, which speed up the algorithm significantly. We tested our method with NHL data. Experiment shows that our method has an order of magnitude speed improvement over the contemporary approaches.

With the increasing popularity of data warehouses and data marts, the ability to refresh data in ... more With the increasing popularity of data warehouses and data marts, the ability to refresh data in a timely fashion is more important than ever. In this paper, a new approach is proposed to incrementally update materialized summary tables. The advantage of the approach is that a) this method retrieves changes on a predicate level; b) the method utilizes time information hidden in user data. Therefore, the method does not introduce extra cost while the other current approaches do introduce overhead for time tag; c) our method applies changes to summary tables by merge. The merge only updates changed values and inserting new rows into summary table, and it is much more efficient than union. By making use of the above advantages, this method is much faster than other methods, and the same accurateness as state-of-the-art approaches can be achieved. By comparing this method with the state-of-theart over IBM DB2 workstation version, it showed that the proposed method outperforms the state-of-the-art in terms of elapse time,

16th IEEE International Conference on Tools with Artificial Intelligence

ABSTRACT

16th IEEE International Conference on Tools with Artificial Intelligence

ABSTRACT

Journal of Information & Knowledge Management, 2004

Association rule mining (ARM) is the data-mining process for finding all association rules in dat... more Association rule mining (ARM) is the data-mining process for finding all association rules in datasets matching user-defined measures of interest such as support and confidence. Usually, ARM proceeds by mining all frequent itemsets — a step known to be very computationally intensive — from which rules are then derived in a straight forward manner. In general, mining all frequent itemsets prunes the space by using the downward closure (or anti-monotonicity) property of support which states that no itemset can be frequent unless all of its subsets are frequent. A large number of papers have addressed the problem of ARM but not many of them have focused on scalability over very large datasets (i.e. when datasets contain a very large number of transactions). In this paper, we propose a new model for representing data and mining frequent itemsets that is based on the P-tree technology for compression and faster logical operations over vertically structured data and on set enumeration tre...

Lecture Notes in Computer Science, 2003

The DataSURG group at NDSU has a long-standing interest in data mining remotely sensed imagery (R... more The DataSURG group at NDSU has a long-standing interest in data mining remotely sensed imagery (RSI) for agricultural, forestry and other prediction and analysis applications. A spatial data structure, the Peano count tree, was developed that provided an efficient, lossless, data mining ready representation of the many types of data involved in these applications. This data structure has made possible the mining of multiple very large data sets, including time-sequence of RSI and multimedia land data. The Peano count tree (P-tree) technology provides an efficient way to store and mine images of any format, together with pertinent land data of still other formats. With the invention of Gene chips and gene expression microarrays (MA data) for use in medicine, plant science and many other application areas, new multimedia data mining challenges appeared. MA data presents a one-time, gene expression level map of thousands of genes subjected to hundreds of conditions. An important multimedia plant science application of the near future is to integrate macroscale analysis of RSI with the micro-scale analysis of MA and to do the latter across multiple organisms. Most of the MA research has been done for a particular organism and the results have been archived as text abstracts (e.g., Medline abstracts). It will therefore be necessary to combine text mining with most multimedia RSI and MA mining. This is truly a multimedia data mining setting. The way text is almost always mined today is to extract pertinent features into tables and to then mine the tables (i.e., extract structured records from the unstructured text first). P-trees are a convenient technology to mine all media involved in this research.

Data mining for spatial data has become increasingly important as more and more organizations are... more Data mining for spatial data has become increasingly important as more and more organizations are exposed to spatial data from such sources as remote sensing, geographical information systems (GIS), astronomy, computer cartography, environmental assessment and planning, bioinformatics, etc. Recently, density based clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been published and recognized as powerful clustering methods for Data Mining. These approaches have run time complexity of) log (n n O when using spatial index techniques, R + tree and grid cell. However, these methods are known to lack scalability with respect to dimensionality. In this paper, we develop a new efficient density based clustering algorithm using HOBBit metrics and P-trees 1 . The fast P-tree ANDing operation facilitates the calculation of the density function within HOBBit rings. The average run time complexity of our algorithm for spatial data in d-dimension is) (n dn O . Our proposed method has com...

Lecture Notes in Computer Science, 2003

Data mining for spatial data has become increasingly important as more and more organizations are... more Data mining for spatial data has become increasingly important as more and more organizations are exposed to spatial data from sources such as remote sensing, geographical information systems, astronomy, computer cartography, environmental assessment and planning, etc. Recently, density based clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been published and recognized as powerful clustering methods for data mining. These approaches have run time complexity of ) log ( n n O when using spatial index techniques, R + tree and grid cell. However, these methods are known to lack scalability with respect to dimensionality. In this paper, a unique approach to efficient neighborhood search and a new efficient density based clustering algorithm using EIN-rings are developed. Our approach exploits compressed vertical data structures, Peano Trees (P-trees 1 ), and fast P-tree logical operations to accelerate the calculation of the density function within EIN-rings. This approach stands in contrast to the ubiquitous approach of vertically scanning horizontal data structures (records). The average run time complexity of our algorithm for spatial data in d-dimension is ) ( n dn O . Our proposed method has comparable cardinality scalability with other density methods for small and medium size of data, but superior speed and dimensional scalability.

Proceedings of the Thirteenth ACM conference on Information and knowledge management - CIKM '04, 2004

One person's noise is another person's signal". Outlier detection is used to clean up datasets an... more One person's noise is another person's signal". Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. Thus, outlier detection is critically important in the information-based society. This paper focuses on finding outliers in large datasets using distance-based methods. First, to speedup outlier detections, we revise Knorr and Ng's distance-based outlier definition; second, a vertical data structure, instead of traditional horizontal structures, is adopted to facilitate efficient outlier detection further. We tested our methods against national hockey league dataset and show an order of magnitude of speed improvement compared to the contemporary distance-based outlier detection approaches.

Lecture Notes in Computer Science, 2003

Proceedings of the 2005 ACM symposium on Applied computing - SAC '05, 2005

Data arising from genomic and proteomic experiments is amassing at high speeds resulting in huge ... more Data arising from genomic and proteomic experiments is amassing at high speeds resulting in huge amounts of raw data; consequently, the need for analyzing such biological data -the understanding of which is still lagging way behind -has been prominently solicited in the post-genomic era we are currently witnessing. In this paper we attempt to analyze annotated genome data by applying a very central data-mining technique known as association rule mining with the aim of discovering rules capable of yielding deeper insights into this type of data. We propose a new technique capable of using domain knowledge in the form of queries in order to efficiently mine only the subset of the associations that are of interest to researcher in an incremental and interactive mode.

Computer Applications in Industry and Engineering, 2004

Knowledge and Information Systems, 2006

Abstract Graphs are increasingly becoming a vital source of information within which a great deal... more Abstract Graphs are increasingly becoming a vital source of information within which a great deal of semantics is embedded. As the size of available graphs increases, our ability to arrive at the embedded semantics grows into a much more complicated task. One form of ...

16th IEEE International Conference on Tools with Artificial Intelligence, 2004

ABSTRACT

Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery - DMKD '03, 2003

Online Analytical Processing (OLAP) is an important application of data warehouses. With more and... more Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, efficient OLAP for spatial data is in great demand. In this paper, we build up a new data warehouse structure -PD-cube. With PD-cube, OLAP operations and queries can be efficiently implemented. All these are accomplished based on the fast logical operations of Peano Trees (P-Trees * ). One of the P-tree variations, Predicate P-tree, is used to efficiently reduce data accesses by filtering out "bit holes" consisting of consecutive 0's. Experiments show that OLAP operations can be executed much faster than with traditional OLAP methods.

Fourth IEEE International Conference on Data Mining (ICDM'04)

ABSTRACT

16th IEEE International Conference on Tools with Artificial Intelligence

ABSTRACT

16th IEEE International Conference on Tools with Artificial Intelligence

ABSTRACT

Journal of Information & Knowledge Management, 2004

Lecture Notes in Computer Science, 2003

Data mining for spatial data has become increasingly important as more and more organizations are... more Data mining for spatial data has become increasingly important as more and more organizations are exposed to spatial data from sources such as remote sensing, geographical information systems, astronomy, computer cartography, environmental assessment and planning, etc. Recently, density based clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been published and recognized as powerful clustering methods for data mining. These approaches have run time complexity of ) log ( n n O when using spatial index techniques, R + tree and grid cell. However, these methods are known to lack scalability with respect to dimensionality. In this paper, a unique approach to efficient neighborhood search and a new efficient density based clustering algorithm using EIN-rings are developed. Our approach exploits compressed vertical data structures, Peano Trees (P-trees 1 ), and fast P-tree logical operations to accelerate the calculation of the density function within EIN-rings. This approach stands in contrast to the ubiquitous approach of vertically scanning horizontal data structures (records). The average run time complexity of our algorithm for spatial data in d-dimension is ) ( n dn O . Our proposed method has comparable cardinality scalability with other density methods for small and medium size of data, but superior speed and dimensional scalability.

Proceedings of the Thirteenth ACM conference on Information and knowledge management - CIKM '04, 2004

Lecture Notes in Computer Science, 2003

Proceedings of the 2005 ACM symposium on Applied computing - SAC '05, 2005

Computer Applications in Industry and Engineering, 2004

Knowledge and Information Systems, 2006

16th IEEE International Conference on Tools with Artificial Intelligence, 2004

ABSTRACT

Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery - DMKD '03, 2003