The Study of Dynamic Aggregation of Relational Attributes on Relational Data Mining (original) (raw)

Dynamic Aggregation of Relational Attributes Based on Feature Construction

Advances in Databases and Information Systems, 2008

The importance of input representation has been recognised already in machine learning. This paper discusses the application of genetic-based feature construction methods to generate input data for the data summarisation method called Dynamic Aggregation of Relational Attributes (DARA). Here, feature construction methods are applied in order to improve the descriptive accuracy of the DARA algorithm. The DARA algorithm is designed to summarise data stored in the non-target tables by clustering them into groups, where multiple records stored in non-target tables correspond to a single record stored in a target table. This paper addresses the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process. This involves solving the problem of constructing a relevant set of features for the DARA algorithm by using a genetic-based algorithm. This work also evaluates several scoring measures used as fitness functions to find the best set of constructed features.

Pattern-Based Transformation Approach to Relational Domain Learning Using Dynamic Aggregation for Relational Attributes

2006

Due to the widespread use of relational databases (mySQL, Oracle, DB2, MsSQL), most data are stored as multiple tables in what can be a very large database. As a result, more efficient algorithms for mining data from multirelational domain need to be implemented. Inductive Logic programming (ILP) techniques are useful for analyzing data in multi-relational databases. Unfortunately, even though not complex in structure, such business data are often large and contain highly non-determinate components, making them difficult for ILP learners geared towards structurally complex tasks. In this paper, we build a novel transformation-based approach to relational domain learning and describe the transformation process implemented through relational aggregation based on pattern distance. In this paper, we present the prototype of “Dynamic Aggregation of Relational Attributes ” (hence called DARA) that is capable of mapping one-to-many relationship into one-to-one relationship, while preventing loss of information, in handling classification task in relational domains. We experimentally show these results in a multi-relational domain that show higher percentage of correctly classified instances and illustrate set of rules extracted using our approach.

Optimizing Feature Construction Process for Dynamic Aggregation of Relational Attributes

Journal of Computer Science, 2009

Abstract: Problem statement: The importance of input representation has been recognized already in machine learning. Feature construction is one of the methods used to generate relevant features for learning data. This study addressed the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process. In other words, this paper discusses the application of genetic algorithm to optimize the feature construction process to generate input data for the data summarization method called Dynamic Aggregation of Relational Attributes (DARA). Approach: The DARA algorithm was designed to summarize data stored in the non-target tables by clustering them into groups, where multiple records stored in non-target tables correspond to a single record stored in a target table. Here, feature construction methods are applied in order to improve the descriptive accuracy of the DARA algorithm. Since, the study addressed the question whether or not the descriptive accuracy of the DARA algorithm benefits from the feature construction process, the involved task includes solving the problem of constructing a relevant set of features for the DARA algorithm by using a genetic-based algorithm. Results: It is shown in the experimental results that the quality of summarized data is directly influenced by the methods used to create patterns that represent records in the (n×p) TF-IDF weighted frequency matrix. The results of the evaluation of the geneticbased feature construction algorithm showed that the data summarization results can be improved by constructing features by using the Cluster Entropy (CE) genetic-based feature construction algorithm. Conclusion: This study showed that the data summarization results can be improved by constructing features by using the cluster entropy genetic-based feature construction algorithm.

Data Summarization Approach to Relational Domain Learning Based on Frequent Pattern to Support the Development of Decision Making

Advanced Data Mining and Applications, 2006

A new approach is needed to handle huge dataset stored in multiple tables in a very-large database. Data mining and Knowledge Discovery in Databases (KDD) promise to play a crucial role in the way people interact with databases, especially decision support databases where analysis and exploration operations are essential. In this paper, we present related works in Relational Data Mining, define the basic notions of data mining for decision support and the types of data aggregation as a means of categorizing or summarizing data. We then present a novel approach to relational domain learning to support the development of decision making models by introducing automated construction of hierarchical multi-attribute model for decision making. We will describe how relational dataset can naturally be handled to support the construction of hierarchical multi-attribute model by using relational aggregation based on pattern's distance. In this paper, we presents the prototype ofDynamic Aggregation of Relational Attributes (hence called DARA) that is capable of supporting the construction of hierarchical multi-attribute model for decision making. We experimentally show these results in a multi-relational domain that shows higher percentage of correctly classified instances and illustrate set of rules extracted from the relational domains to support decision-making.

An Efficient Data Mining Dataset Preparation using Aggregation in Relational Database

To prepare the data set from relational database management system for data mining is very difficult and time consuming task. These prepared data can be used as input in data mining analysis. But traditional structured query language aggregate function returns the records in one column per aggregated group. This paper presents the horizontal representation of data used for dataset preparation in data mining analysis and reduce memory space when evaluated with the cancer dataset.

Discretization and grouping: Preprocessing steps for data mining

Lecture Notes in Computer Science, 1998

Unlike on-line discretization performed by a number of machine learning (ML) algorithms for building decision trees or decision rules, we propose off-line algorithms for discretizing numerical attributes and grouping values of nominal attributes. The number of resulting intervals obtained by discretization depends only on the data; the number of groups corresponds to the number of classes. Since both discretization and grouping is done with respect to the goal classes, the algorithms are suitable only for classification/prediction tasks. As a side effect of the off-line processing, the number of objects in the datasets and number of attributes may be reduced.

Multi-relational concept discovery with aggregation

2009 24th International Symposium on Computer and Information Sciences, 2009

ABSTRACT Concept discovery aims at finding the rules that best describe the given target predicate (i.e., the concept). Aggregation information such as average, count, max, etc. are descriptive for the domains that an aggregated value takes part in the definition of the concept. Therefore, a concept discovery system needs aggregation capability in order to construct high quality rules (with high accuracy and coverage) for such domains. In this work, we describe a method for concept discovery with aggregation on an ILP-based concept discovery system, namely C2D-A. C2D-A extends C2D by considering all instances together and thus improves the generated rule's quality. Together with this extension, aggregation handling mechanism is modified accordingly, leading to more accurate aggregate values, as well.

Aggregation in Confidence-Based Concept Discovery for Multi-Relational Data Mining

Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate highquality patterns. In this work, a new ILP-based concept discovery method is described in which userdefined specifications are relaxed. Moreover, this new method directly works on relational databases. In addition to this, a new confidence-based pruning is used in this technique. A set of experiments are conducted to test the performance of the new method.