Statistical relational learning for document mining (original) (raw)
Learning statistical models from relational data
2011
Abstract Statistical Relational Learning (SRL) is a subarea of machine learning which combines elements from statistical and probabilistic modeling with languages which support structured data representations.
Research on Statistical Relational Learning
2007
This paper presents an overview of the research on learning statistical models of relational data being carried out at the University of Washington. Our work falls into five main directions: learning models of social networks; learning models of sequential relational processes; scaling up statistical relational learning to massive data sources; learning for knowledge integration; and learning programs in procedural languages. We describe some of the common themes and research issues arising from this work.
A Logic-Based Approach to Mining Inductive Databases
Lecture Notes in Computer Science, 2007
In this paper, we discuss the main problems of inductive query languages and optimisation issues. We present a logic-based inductive query language and illustrate the use of aggregates and exploit a new join operator to model specific data mining tasks. We show how a fixpoint operator works for association rule mining and a clustering method. A preliminary experimental result shows that fixpoint operator outperforms SQL and Apriori methods. The results of our framework could be useful for inductive query language design in the development of inductive database systems.
A Wordification Approach to Relational Data Mining
Lecture Notes in Computer Science, 2013
This paper describes a propositionalization technique called wordification. Wordification is inspired by text mining and can be seen as a transformation of a relational database into a corpus of documents. As in previous propositionalization methods, after the wordification step any propositional data mining algorithm can be applied. The most notable advantage of the presented technique is greater scalability-the propositionalization step is done in time linear to the number of attributes times the number of examples for one-to-many databases. Furthermore, wordification results in easily understandable propositional feature representation. We present our initial experiments on two real-life datasets.
Research on statistical relational learning at the University of Washington
2003
This paper presents an overview of the research on learning statistical models from relational data being carried out at the University of Washington. Our work falls into five main directions: learning models of social networks; learning models of sequential relational processes; scaling up statistical relational learning to massive data sources; learning for knowledge integration; and learning programs in procedural languages. We describe some of the common themes and research issues arising from this work.
Comparison of graph-based and logic-based multi-relational data mining
ACM SIGKDD Explorations Newsletter, 2005
We perform an experimental comparison of the graph-based multi-relational data mining system, Subdue, and the inductive logic programming system, CProgol, on the Mutagenesis dataset and various artificially generated Bongard problems. Experimental results indicate that Subdue can significantly outperform CProgol while discovering structurally large multi-relational concepts. It is also observed that CProgol is better at learning semantically complicated concepts and it tends to use background knowledge more effectively than Subdue. An analysis of the results indicates that the differences in the performance of the systems are a result of the difference in the expressiveness of the logic-based and the graph-based representations. The ability of graph-based systems to learn structurally large concepts comes from the use of a weaker representation whose expressiveness is intermediate between propositional and first-order logic. The use of this weaker representation is advantageous whil...
An inductive database and query language in the relational model
Proceedings of the 11th international conference on Extending database technology Advances in database technology - EDBT '08, 2008
In the demonstration, we will present the concepts and an implementation of an inductive database -as proposed by Imielinski and Mannila -in the relational model. The goal is to support all steps of the knowledge discovery process, from pre-processing via data mining to post-processing, on the basis of queries to a database system. The query language SIQL (structured inductive query language), an SQL extension, offers query primitives for feature selection, discretization, pattern mining, clustering, instance-based learning and rule induction. A prototype system processing such queries was implemented as part of the SINDBAD (structured inductive database development) project. Key concepts of this system, among others, are the closure of operators and distances between objects. To support the analysis of multi-relational data, we incorporated multi-relational distance measures based on set distances and recursive descent. The inclusion of rule-based classification models made it necessary to extend the data model and the software architecture significantly. The prototype is applied to three different applications: gene expression analysis, gene regulation prediction and structure-activity relationships (SARs) of small molecules.
FOIL-D: Efficiently Scaling FOIL for Multi-relational Data Mining of Large Datasets
Lecture Notes in Computer Science, 2004
Multi-relational rule mining is important for knowledge discovery in relational databases as it allows for discovery of patterns involving multiple relational tables. Inductive logic programming (ILP) techniques have had considerable success on a variety of multi-relational rule mining tasks, however, most ILP systems do not scale to very large datasets. In this paper we present two extensions to a popular ILP system, FOIL, that improve its scalability. (i) We show how to interface FOIL directly to a relational database management system. This enables FOIL to run on data sets that previously had been out of its scope. (ii) We describe estimation methods, based on histograms, that significantly decrease the computational cost of learning a set of rules. We present experimental results that indicate that on a set of standard ILP datasets, the rule sets learned using our extensions are equivalent to those learned with standard FOIL but at considerably less cost.
Probabilistic logic-based characterization of knowledge discovery in databases
2002
From the perspective of knowledge representation and reasoning as well as for the automation of the knowledge discovery process, we argue that a formal logical foundation is needed for KDD and suggest Bacchus' probability logic is a good choice. It is generally accepted that the unique and most important feature of a KDD system lies in its ability to discover previously unknown and potentially useful patterns. Therefore we give a formal definition of "pattern" as well as its determiners, which are "previously unknown" and "potentially useful", by completely staying within the expressiveness of Bacchus' probability logic language. Furthermore, based on this logic, we propose a logic induction operator that defines the process through which all the potentially useful patterns embedded in the given data can be discovered. Hence, general knowledge discovery (independent of any application) is defined to be any process functionally equivalent to the process specified by this logic induction operator with respect to given data. By customizing the parameters and/or providing more constraints, users can guide the knowledge discovery process to obtain a specific subset of the general previously unknown and potentially useful patterns in order to satisfy their current needs.
SRL2003 IJCAI 2003 Workshop on Learning Statistical Models from Relational Data
2003
This workshop is the second in a series of workshops held in conjunction with AAAI and IJCAI. The first workshop was held in July, 2000 at AAAI. Notes from that workshop are available at http://robotics. stanford. edu/srl/. Since the AAAI 2000 workshop, there has been a surge of interest in this area. The efforts have been diffused across a wide collection of sub-areas in computer science including machine learning, database management and theoretical computer science.
Springer eBooks, 2011
Biological processes where every gene and protein participates is an essential knowledge for designing disease treatments. Nowadays, these annotations are still unknown for many genes and proteins. Since making annotations from in-vivo experiments is costly, computational predictors are needed for different kinds of annotation such as metabolic pathway, interaction network, protein family, tissue, disease and so on. Biological data has an intrinsic relational structure, including genes and proteins, which can be grouped by many criteria. This hinders the possibility of finding good hypotheses when attribute-value representation is used. Hence, we propose the generic Modular Multi-Relational Framework (MMRF) to predict different kinds of gene and protein annotation using Relational Data Mining (RDM). The specific MMRF application to annotate human protein with diseases verifies that group knowledge (mainly protein-protein interaction pairs) improves the prediction, particularly doubling the area under the precision-recall curve.
Learning graphical models for relational data via lattice search
Machine Learning, 2012
Many machine learning applications that involve relational databases incorporate first-order logic and probability. Relational extensions of graphical models include Parametrized Bayes Net (Poole in IJCAI, pp. 985-991, 2003), Probabilistic Relational Models (Getoor et al. in Introduction to statistical relational learning, pp. 129-173, 2007), and Markov Logic Networks (MLNs) (Domingos and Richardson in Introduction to statistical relational learning, 2007). Many of the current state-of-the-art algorithms for learning MLNs have focused on relatively small datasets with few descriptive attributes, where predicates are mostly binary and the main task is usually prediction of links between entities. This paper addresses what is in a sense a complementary problem: learning the structure of a graphical model that models the distribution of discrete descriptive attributes given the links between entities in a relational database. Descriptive attributes are usually nonbinary and can be very informative, but they increase the search space of possible candidate clauses. We present an efficient new algorithm for learning a Parametrized Bayes Net that performs a level-wise search through the table join lattice for relational dependencies. From the Bayes net we obtain an MLN structure via a standard moralization procedure for converting directed models to undirected models. Learning MLN structure by moralization is 200-1000 times faster and scores substantially higher in predictive accuracy than benchmark MLN algorithms on five relational databases.
Relational Data Mining with Inductive Logic Programming for Link Discovery
2002
Link discovery (LD) is an important task in data mining for counter-terrorism and is the focus of DARPA's Evidence Extraction and Link Discovery (EELD) research program. Link discovery concerns the identification of complex relational patterns that indicate potentially threatening activities in large amounts of relational data. Most data-mining methods assume data is in the form of a feature-vector (a single relational table) and cannot handle multi-relational data. Inductive logic programming is a form of relational data mining that discovers rules in first-order logic from multi-relational data. This paper discusses the application of ILP to learning patterns for link discovery.
Statistical relational learning: Four claims and a survey
2003
Statistical relational learning (SRL) research has made significant progress over the last 5 years. We have successfully demonstrated the feasibility of a number of probabilistic models for relational data, including probabilistic relational models, Bayesian logic programs, and relational probability trees, and the interest in SRL is growing. However, in order to sustain and nurture the growth of SRL as a subfield we need to refocus our efforts on the science of machine learning—moving from demonstrations to comparative and ablation studies.
Database Mining through Inductive Logic Programming
2007
Rapid growth in the automation of business transactions has lead to an explosion in the size of databases. It has been realised for a long time that the data in these databases contains hidden information which needs to be extracted. Data mining is a step in this ...
On Multi-Relational Data Mining for Foundation of Data Mining
2007 IEEE/ACS International Conference on Computer Systems and Applications, 2007
Multi-Relational Data Mining (MRDM) deals with knowledge discovery from relational databases consisting of one or multiple tables. As a typical technique for MRDM, inductive logic programming (ILP) has the power of dealing with reasoning related to various data mining tasks in a "unified" way. Like granular computing (GrC), ILP-based MRDM models the data and the mining process on these data through intension and extension of concepts. Unlike GrC, however, the inference ability of ILP-based MRDM lies in the powerful Prolog-like search engine. Although this important feature suggests that through ILP, MRDM can contribute to the foundation of data mining (FDM), the interesting perspective of "ILPbased MRDMfor FDM" has not been investigated in the past. In this paper, we examine this perspective. We provide justification and observations, and report results of related experiments. The primary objective of this paper is to draw attention to FDM researchers from the ILP-based MRDMperspective.
Statistical Relational Learning: A State-Of-The-Art Review
Journal of Engineering and Technology, 2019
The objective of this paper is to review the state-of-the-art of statistical relational learning (SRL) models developed to deal with machine learning and data mining in relational domains in presence of missing, partially observed, and/or noisy data. It starts by giving a general overview of conventional graphical models, first-order logic and inductive logic programming approaches as needed for background. The historical development of each SRL key model is critically reviewed. The study also focuses on the practical application of SRL techniques to a broad variety of areas and their limitations.
A multistrategy approach to relational knowledge discovery in databases
1997
When learning froIn very large databases, the reduction of complexity is of highest importance. Two extremes of making knowledge discovery in databases (KDD) feasible have been put forward. One extreme is to choose a most simple hypothesis language and so to be capable of very fast learning on real-world databases. The opposite extreme is to select a small data set and be capable of learning very expressive (firstorder logic) hypotheses. A multistrategy approach allows to combine most of the advantages and exclude most of the disadvantages. More simple learning algorithms detect hierarchies that are used in order to structure the hypothesis space for a more complex learning algorithm. The better structured the hypothesis space is, the better can learning prune away uninteresting or losing hypotheses and the faster it becomes. We have combined inductive logic programming (ILP) directly with a relational database. The ILP algorithm is controlled in a model-driven way by the user and in a data-driven way by structures that are induced by three simple learning algorithms.
Introduction to statistical relational learning
2007
Handling inherent uncertainty and exploiting compositional structure are fundamental to understanding and designing large-scale systems. Statistical relational learning builds on ideas from probability theory and statistics to address uncertainty while incorporating tools from logic, databases, and programming languages to represent structure.