An Optimized Approach for Feature Extraction in Multi-Relational Statistical Learning (original) (raw)
Related papers
2020
Multi-relational classification is highly challengeable task in data mining, because so much data in our world is organised in multiple relations. The challenge comes from the huge collection of search spaces and high calculation cost arises in the selection of feature due to excessive complexity in the various relations. The state-of-the-art approach is based on clusters and inductive logical programming to retrieve important features and derived hypothesis. However, those techniques are very slow and unable to create enough data and information to produce efficient classifiers. In the given paper, we proposed a fast and effective method for the feature selection using multi-relational classification. Moreover we introduced the natural join and SVM based feature selection in multi-relation statistical learning. The performance of our model on various datasets indicates that our model is efficient, reliable and highly accurate.
Intelligent Automation & Soft Computing, 2016
Traditional data mining algorithms will not work efficiently for most of the real world applications where the data is stored in relational format. Useful patterns can certainly be extracted from multiple relations using an existing traditional learning algorithm of data mining, but it would involve a lot of complexity. So there is a need of a multi relational classification, which analyzes relational data and predicts unknown patterns automatically. Moreover the performances of existing relational classifiers are limited, because the existing algorithms are not able to use different classifiers based on characteristics of different relations. The goal of the proposed approach is to select appropriate classifiers based on characteristics of different relations in the relational database to improve the overall performance without affecting the running time. So multi criteria classifier selection function based on ratio of accuracy and running time is used to select the most efficient classifier using Meta Learning. In the proposed classifier selection function, accuracy is used as a measure of benefit and running time is used as a measure of cost and their ratio is taken to ensure that the efficient classifier is selected. The experimental results show that the performance of proposed relational classification is better in terms of efficiency when compared to all other existing algorithms available in the literature. We are able to achieve best results by selecting an efficient algorithm for every relation contributing in the relational classification.
An efficient multi-relational Naïve Bayesian classifier based on semantic relationship graph
Proceedings of the 4th international workshop on Multi-relational mining - MRDM '05, 2005
Classification is one of the most popular data mining tasks with a wide range of applications, and lots of algorithms have been proposed to build accurate and scalable classifiers. Most of these algorithms only take a single table as input, whereas in the real world most data are stored in multiple tables and managed by relational database systems. As transferring data from multiple tables into a single one usually causes many problems, development of multi-relational classification algorithms becomes important and attracts many researchers' interests. Existing works about extending Naïve Bayes to deal with multi-relational data either have to transform data stored in tables to mainmemory Prolog facts, or limit the search space to only a small subset of real world applications. In this work, we aim at solving these problems and building an efficient, accurate Naïve Bayesian classifier to deal with data in multiple tables directly. We propose an algorithm named Graph-NB, which upgrades Naïve Bayesian classifier to deal with multiple tables directly. In order to take advantage of linkage relationships among tables, and treat different tables linked to the target table differently, a semantic relationship graph is developed to describe the relationship and to avoid unnecessary joins. Furthermore, to improve accuracy, a pruning strategy is given to simplify the graph to avoid examining too many weakly linked tables. Experimental study on both realworld and synthetic databases shows its high efficiency and good accuracy.
A Random Length Feature Construction Method for Learning Relational Data using DARA
2013
In learning relational data, DARA (Dynamic Aggregation of Relational Attributes) algorithm transforms a relational data model representation into a vector space model representation. This data transformation is required in order to summarize or cluster data stored in relational databases in which a target record stored in a target table has a one-to-many relationship with non-target records stored in a non-target table. The descriptive accuracy of the summarized data performed by DARA is highly influenced by the representation of records stored in non-target tables that are associated with records stored in target table. This is important because when this summarized data is fed as input data for the classification task, the predictive accuracy of the classification task will also be affected. This paper proposes novel feature construction methods, called Variable Length Feature Construction without Substitution (VLFCWOS) and Variable Length Feature Construction with Substitution (VLFCWS), in order to construct a set of relevant features in learning relational data. These methods are proposed to improve the descriptive accuracy of the summarized data. In the process of summarizing relational data, a genetic algorithm is also applied and several feature scoring measures are evaluated in order to find the best set of relevant constructed features. In this work, we empirically compare the predictive accuracies of classification tasks based on the proposed feature construction methods and also the existing feature construction methods. The experimental results show that the predictive accuracy of classifying data that are summarized based on VLFCWS method using Total Cluster Entropy combined with Information Gain (CE-IG) as feature scoring outperforms in most cases.
CrossMine: efficient classification across multiple database relations
Proceedings. 20th International Conference on Data Engineering
Most of today's structured data is stored in relational databases. Such a database consists of multiple relations that are linked together conceptually via entity-relationship links in the design of relational database schemas. Multi-relational classification can be widely used in many disciplines including financial decision making and medical research. However, most classification approaches only work on single "flat" data relations. It is usually difficult to convert multiple relations into a single flat relation without either introducing huge "universal relation" or losing essential information. Previous works using Inductive Logic Programming approaches (recently also known as Relational Mining) have proven effective with high accuracy in multi-relational classification. Unfortunately, they fail to achieve high scalability w.r.t. the number of relations in databases because they repeatedly join different relations to search for good literals. In this paper we propose CrossMine, an efficient and scalable approach for multi-relational classification. CrossMine employs tuple ID propagation, a novel method for virtually joining relations, which enables flexible and efficient search among multiple relations. CrossMine also uses aggregated information to provide essential statistics for classification. A selective sampling method is used to achieve high scalability w.r.t. the number of tuples in the databases. Our comprehensive experiments on both real and synthetic databases demonstrate the high scalability and accuracy of CrossMine.
Multi-Relational Data Mining A Comprehensive Survey
2015
Multi-Relational Data Mining or MRDM is a growing research area focuses on discovering hidden patterns and useful knowledge from relational databases. While the vast majority of data mining algorithms and techniques look for patterns in a flat single-table data representation, the sub-domain of MRDM looks for patterns that involve multiple tables (relations) from a relational database. This subdomain has received an increased research attention during the last two decades due to the wide range of possible applications. As a result of that growing attention, many successful multi-relational data mining algorithms and techniques were presented. This chapter presents a comprehensive review about multi-relational data mining. It discusses the different approaches researchers have followed to explore the relational search space while highlighting some of the most significant challenges facing researchers working in this sub-domain. The chapter also describes number of MRDM systems that h...
Advances in Parallel Computing, 2021
In today’s day of Modern era when the data handling objectives are getting bigger and bigger with respect to volume, learning and inferring knowledge from complex data becomes the utmost problem. Almost all of the real-world information are maintained under a relational fashion holding multiple relations unlike orthodox approaches containing single relational as a whole. Moreover several fields viz. biological informatics, microbiology, chemical computations needed some more dependable and expressive approach which can provide more sophisticated results with faster speed. Hence in context with multi-relational data mining in which data is directly retrieved from different records without dumping into single table, we have described a novel approach of improved Multi-Relational Decision Tree Learning Algorithm based on the implementations. In this paper provided a comparative study of the aforementioned approach in which we have taken certain results from the literature review. Exper...
Statistical Relational Learning: A State-Of-The-Art Review
Journal of Engineering and Technology, 2019
The objective of this paper is to review the state-of-the-art of statistical relational learning (SRL) models developed to deal with machine learning and data mining in relational domains in presence of missing, partially observed, and/or noisy data. It starts by giving a general overview of conventional graphical models, first-order logic and inductive logic programming approaches as needed for background. The historical development of each SRL key model is critically reviewed. The study also focuses on the practical application of SRL techniques to a broad variety of areas and their limitations.
Multi-Relational Data Mining using Probabilistic Models Research Summary
2001
Abstract. We are often faced with the challenge of mining data represented in relational form. Unfortunately, most statistical learning methods work only with ���flat��� data representations. Thus, to apply these methods, we are forced to convert the data into a flat form, thereby not only losing its compact representation and structure but also potentially introducing statistical skew. These drawbacks severely limit the ability of current statistical methods to mine relational databases.