On the Design of a Parallel Object Oriented Data Mining Toolkit (original) (raw)

Towards a parallel data mining toolbox

Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, 2001

This paper presents research projects tackling two aspects in data mining. First, a toolbox is discussed that allows flexible and interactive data exploration, analysis and presentation using the scripting language Python. The advantages of this toolbox are that it provides the functionality to process multiple SQL queries in parallel, and enables fast data retrieval using a supervised caching mechanism for commonly used queries. These two facets of the toolbox allow for fast, efficient data access reducing the time spent on data exploration, preparation and analysis.

A middleware for developing parallel data mining implementations

2001

Abstract Data mining is an interdisciplinary field, having applications in diverse areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, consumer profiling, etc. In each of these application domains, the amount of data available for analysis has exploded in recent years, making the scalability of data

Parallel Data Mining Experimentation Using Flexible Configurations

Lecture Notes in Computer Science, 2002

When data mining first appeared, several disciplines related to data analysis, like statistics or artificial intelligence were combined toward a new topic: extracting significant patterns from data. The original data sources were small datasets and, therefore, traditional machine learning techniques were the most common tools for this tasks. As the volume of data grows these traditional methods were reviewed and extended with the knowledge from experts working on the field of data management and databases. Today problems are even bigger than before and, once again, a new discipline allows the researchers to scale up to these data. This new discipline is distributed and parallel processing. In order to use parallel processing techniques, specific factors about the mining algorithms and the data should be considered. Nowadays, there are several new parallel algorithms, that in most of the cases are extensions of a traditional centralized algorithm. Many of these algorithms have common core parts and only differ on distribution schema, parallel coordination or load/task balancing methods. We call these groups algorithm families. On this paper we introduce a methodology to implement algorithm families. This methodology is founded on the MOIRAE distributed control architecture. In this work we will show how this architecture allows researchers to design parallel processing components that can change, dynamically, their behavior according to some control policies.

A parallel data management layer for data mining

2005

We propose the design of a data management abstraction level to implement a full set of parallel KDD applications, with minimal performance overhead and greater scalability than conventional DBMS, providing a high-level parallel API to be exploited by parallel and out-of-core data mining algorithms. Our approach exploits knowledge of the parallel and sequential structure of applications. Programs are developed with the ASSIST parallel programming environment, and expose explicit algorithmic hints in the sequential code through the data management API. We describe an existing prototype and report examples and first test results with mining algorithms.

A parallel decision tree builder for mining very large visualization datasets

SMC 2000 Conference Proceedings. 2000 IEEE International Conference on Systems, Man and Cybernetics. 'Cybernetics Evolving to Systems, Humans, Organizations, and their Complex Interactions' (Cat. No.00CH37166)

Simulation problems in the DOE ASCI program generate visualization datasets more than a terabyte in size. The practical difficulties in visualizing such datasets motivate the desire for automati? recognition of salient events. We have developed a parallel decision tree classifier for use in this context. Comparisons to ScalParC, a previous attempt to build a fast parallelization of a decision tree classifier, are provided. Our parallel classifier executes on the "ASCI Red" supercomputer. Experiments demonstrate that datasets too large to be processed on a single processor can be efficiently handled in parallel, and suggest that there need not be any decrease in accuracy relative to a monolithic classifier constructed on a single processor.

A distributed framework for parallel data mining using HPJava

BT technology journal, 1999

Java has become a language of choice for applications executing in heterogeneous environments utilising distributed objects and multithreading. To handle large data sets, scalable and efficient implementations of data mining approaches are required, generally employing computationally intensive algorithms. Conventional Java implementations do not directly provide support for the data structures often encountered in such algorithms, and they also lack repeatability in numerical precision across platforms. This paper describes ...

A data mining toolset for distributed high-performance platforms

2002

Abstract Today a large number of scientific and commercial applications often require to analyse large data sets maintained over geographically distributed sites by using the computational power of distributed high-performance environments. Advances in networking technology and computational infrastructure made it possible to construct large-scale distributed computing platforms, called computational grids, that provide dependable, consistent, and pervasive access to high-end computational resources.

Efficient Data Mining: Scripting and Scalable Parallel Algorithms

2000

This paper presents our approach to data mining that allows the coupling of parallel applications with a scripting language resulting in an efficient and flexible toolbox. Parallel algorithms which are scalable both in data size and number of processors are a key issue to be able to solve the ever increasing problems in data mining. On the other hand, data mining applications should be flexible to allow interactive data exploration. By using a toolbox written in a scripting language we are able to steer parallel applications in a flexible way, thus fulfilling the needs of a data miner for fast interactive data analysis. The chosen approach is discussed and first results are presented.