Statistical relational learning for document mining (original) (raw)
Related papers
Learning statistical models from relational data
2011
Abstract Statistical Relational Learning (SRL) is a subarea of machine learning which combines elements from statistical and probabilistic modeling with languages which support structured data representations.
Research on Statistical Relational Learning
2007
This paper presents an overview of the research on learning statistical models of relational data being carried out at the University of Washington. Our work falls into five main directions: learning models of social networks; learning models of sequential relational processes; scaling up statistical relational learning to massive data sources; learning for knowledge integration; and learning programs in procedural languages. We describe some of the common themes and research issues arising from this work.
A Logic-Based Approach to Mining Inductive Databases
Lecture Notes in Computer Science, 2007
In this paper, we discuss the main problems of inductive query languages and optimisation issues. We present a logic-based inductive query language and illustrate the use of aggregates and exploit a new join operator to model specific data mining tasks. We show how a fixpoint operator works for association rule mining and a clustering method. A preliminary experimental result shows that fixpoint operator outperforms SQL and Apriori methods. The results of our framework could be useful for inductive query language design in the development of inductive database systems.
A Wordification Approach to Relational Data Mining
Lecture Notes in Computer Science, 2013
This paper describes a propositionalization technique called wordification. Wordification is inspired by text mining and can be seen as a transformation of a relational database into a corpus of documents. As in previous propositionalization methods, after the wordification step any propositional data mining algorithm can be applied. The most notable advantage of the presented technique is greater scalability-the propositionalization step is done in time linear to the number of attributes times the number of examples for one-to-many databases. Furthermore, wordification results in easily understandable propositional feature representation. We present our initial experiments on two real-life datasets.
Research on statistical relational learning at the University of Washington
2003
This paper presents an overview of the research on learning statistical models from relational data being carried out at the University of Washington. Our work falls into five main directions: learning models of social networks; learning models of sequential relational processes; scaling up statistical relational learning to massive data sources; learning for knowledge integration; and learning programs in procedural languages. We describe some of the common themes and research issues arising from this work.
An inductive database and query language in the relational model
Proceedings of the 11th international conference on Extending database technology Advances in database technology - EDBT '08, 2008
In the demonstration, we will present the concepts and an implementation of an inductive database -as proposed by Imielinski and Mannila -in the relational model. The goal is to support all steps of the knowledge discovery process, from pre-processing via data mining to post-processing, on the basis of queries to a database system. The query language SIQL (structured inductive query language), an SQL extension, offers query primitives for feature selection, discretization, pattern mining, clustering, instance-based learning and rule induction. A prototype system processing such queries was implemented as part of the SINDBAD (structured inductive database development) project. Key concepts of this system, among others, are the closure of operators and distances between objects. To support the analysis of multi-relational data, we incorporated multi-relational distance measures based on set distances and recursive descent. The inclusion of rule-based classification models made it necessary to extend the data model and the software architecture significantly. The prototype is applied to three different applications: gene expression analysis, gene regulation prediction and structure-activity relationships (SARs) of small molecules.
FOIL-D: Efficiently Scaling FOIL for Multi-relational Data Mining of Large Datasets
Lecture Notes in Computer Science, 2004
Multi-relational rule mining is important for knowledge discovery in relational databases as it allows for discovery of patterns involving multiple relational tables. Inductive logic programming (ILP) techniques have had considerable success on a variety of multi-relational rule mining tasks, however, most ILP systems do not scale to very large datasets. In this paper we present two extensions to a popular ILP system, FOIL, that improve its scalability. (i) We show how to interface FOIL directly to a relational database management system. This enables FOIL to run on data sets that previously had been out of its scope. (ii) We describe estimation methods, based on histograms, that significantly decrease the computational cost of learning a set of rules. We present experimental results that indicate that on a set of standard ILP datasets, the rule sets learned using our extensions are equivalent to those learned with standard FOIL but at considerably less cost.
Probabilistic logic-based characterization of knowledge discovery in databases
2002
From the perspective of knowledge representation and reasoning as well as for the automation of the knowledge discovery process, we argue that a formal logical foundation is needed for KDD and suggest Bacchus' probability logic is a good choice. It is generally accepted that the unique and most important feature of a KDD system lies in its ability to discover previously unknown and potentially useful patterns. Therefore we give a formal definition of "pattern" as well as its determiners, which are "previously unknown" and "potentially useful", by completely staying within the expressiveness of Bacchus' probability logic language. Furthermore, based on this logic, we propose a logic induction operator that defines the process through which all the potentially useful patterns embedded in the given data can be discovered. Hence, general knowledge discovery (independent of any application) is defined to be any process functionally equivalent to the process specified by this logic induction operator with respect to given data. By customizing the parameters and/or providing more constraints, users can guide the knowledge discovery process to obtain a specific subset of the general previously unknown and potentially useful patterns in order to satisfy their current needs.
SRL2003 IJCAI 2003 Workshop on Learning Statistical Models from Relational Data
2003
This workshop is the second in a series of workshops held in conjunction with AAAI and IJCAI. The first workshop was held in July, 2000 at AAAI. Notes from that workshop are available at http://robotics. stanford. edu/srl/. Since the AAAI 2000 workshop, there has been a surge of interest in this area. The efforts have been diffused across a wide collection of sub-areas in computer science including machine learning, database management and theoretical computer science.