Approximate Reasoning for Efficient Anytime Induction from Relational Knowledge Bases (original) (raw)

Approximate Relational Reasoning by Stochastic Propositionalization

Studies in Computational Intelligence, 2010

For many real-world applications it is important to choose the right representation language. While the setting of First Order Logic (FOL) is the most suitable one to model the multi-relational data of real and complex domains, on the other hand it puts the question of the computational complexity of the knowledge induction process. A way of tackling the complexity of such real domains, in which a lot of relationships are required to model the objects involved, is to use a method that reformulates a multi-relational learning task into an attribute-value one. In this chapter we present an approximate reasoning method able to keep low the complexity of a relational problem by using a stochastic inference procedure. The complexity of the relational language is decreased by means of a propositionalization technique, while the NP-completeness of the deduction is tackled using an approximate query evaluation. The proposed approximate reasoning technique has been used to solve the problem of relational rule induction as well as the task of relational clustering. An anytime algorithm has been used for the induction, implemented by a population based method, able to efficiently extract knowledge from relational data, while the clustering task, both unsupervised and supervised, has been solved using a Partition Around Medoid (PAM) clustering algorithm. The validity of the proposed techniques has been proved making an empirical evaluation on real-world datasets.

Stochastic Propositionalization for Efficient Multi-relational Learning

Lecture Notes in Computer Science, 2008

The efficiency of multi-relational data mining algorithms, addressing the problem of learning First Order Logic (FOL) theories, strongly depends on the search method used for exploring the hypotheses space and on the coverage test assessing the validity of the learned theory against the training examples. A way of tackling the complexity of this kind of learning systems is to use a propositional method that reformulates a multi-relational learning problem into an attribute-value one. We propose a population based algorithm that using a stochastic propositional method efficiently learns complete FOL definitions.

On Multi-Relational Data Mining for Foundation of Data Mining

2007 IEEE/ACS International Conference on Computer Systems and Applications, 2007

Multi-Relational Data Mining (MRDM) deals with knowledge discovery from relational databases consisting of one or multiple tables. As a typical technique for MRDM, inductive logic programming (ILP) has the power of dealing with reasoning related to various data mining tasks in a "unified" way. Like granular computing (GrC), ILP-based MRDM models the data and the mining process on these data through intension and extension of concepts. Unlike GrC, however, the inference ability of ILP-based MRDM lies in the powerful Prolog-like search engine. Although this important feature suggests that through ILP, MRDM can contribute to the foundation of data mining (FDM), the interesting perspective of "ILPbased MRDMfor FDM" has not been investigated in the past. In this paper, we examine this perspective. We provide justification and observations, and report results of related experiments. The primary objective of this paper is to draw attention to FDM researchers from the ILP-based MRDMperspective.

Inductive Logic Programming

Springer eBooks, 2011

Biological processes where every gene and protein participates is an essential knowledge for designing disease treatments. Nowadays, these annotations are still unknown for many genes and proteins. Since making annotations from in-vivo experiments is costly, computational predictors are needed for different kinds of annotation such as metabolic pathway, interaction network, protein family, tissue, disease and so on. Biological data has an intrinsic relational structure, including genes and proteins, which can be grouped by many criteria. This hinders the possibility of finding good hypotheses when attribute-value representation is used. Hence, we propose the generic Modular Multi-Relational Framework (MMRF) to predict different kinds of gene and protein annotation using Relational Data Mining (RDM). The specific MMRF application to annotate human protein with diseases verifies that group knowledge (mainly protein-protein interaction pairs) improves the prediction, particularly doubling the area under the precision-recall curve.

A first-order representation for knowledge discovery and bayesian classification on relational data

2000

In this paper we consider different representations for relational learning problems, with the aim of making ILP methods more applicable to real-world problems. In the past, ILP tended to concentrate on the term representation, with the flattened Datalog representation as a 'poor man's version'. There has been relatively little emphasis on database-oriented representations, using e.g. the relational datamodel or the Entity-Relationship model. On the other hand, much of the available data is stored in multi-relational databases. Even if we don't actually interface our ILP systems with a DBMS, we need to understand the database representation sufficiently in order to convert it to an ILP representation. Such conversions and relations between different representations are the subject of this paper. We consider four different representations: the Entity-Relationship model, the relational model, a flattened individual-centred representation based on socalled ISP declarations we use for our ILP systems Tertius and 1BC, and the term-based representation. We argue that the term-based representation does not have all the flexibility and expressiveness provided by the other representations. For instance, there is no way to deal with graphs without partly flattening the data (i.e., introducing identifiers). Furthermore, there is no easy way of switching to another individual without converting the data, let alone learning with different individual types. The flattened representation has clear advantages in these respects.

Relational Data Mining with Inductive Logic Programming for Link Discovery

2002

Link discovery (LD) is an important task in data mining for counter-terrorism and is the focus of DARPA's Evidence Extraction and Link Discovery (EELD) research program. Link discovery concerns the identification of complex relational patterns that indicate potentially threatening activities in large amounts of relational data. Most data-mining methods assume data is in the form of a feature-vector (a single relational table) and cannot handle multi-relational data. Inductive logic programming is a form of relational data mining that discovers rules in first-order logic from multi-relational data. This paper discusses the application of ILP to learning patterns for link discovery.

Filtering Multi-Instance Problems to Reduce Dimensionality in Relational Learning

Journal of Intelligent Information Systems, 2004

Attribute-value based representations, standard in today's data mining systems, have a limited expressiveness. Inductive Logic Programming provides an interesting alternative, particularly for learning from structured examples whose parts, each with its own attributes, are related to each other by means of first-order predicates. Several subsets of first-order logic (FOL) with different expressive power have been proposed in Inductive Logic Programming (ILP). The challenge lies in the fact that the more expressive the subset of FOL the learner works with, the more critical the dimensionality of the learning task. The Datalog language is expressive enough to represent realistic learning problems when data is given directly in a relational database, making it a suitable tool for data mining. Consequently, it is important to elaborate techniques that will dynamically decrease the dimensionality of learning tasks expressed in Datalog, just as Feature Subset Selection (FSS) techniques do it in attribute-value learning. The idea of re-using these techniques in ILP runs immediately into a problem as ILP examples have variable size and do not share the same set of literals. We propose here the first paradigm that brings Feature Subset Selection to the level of ILP, in languages at least as expressive as Datalog. The main idea is to first perform a change of representation, which approximates the original relational problem by a multi-instance problem. The representation obtained as the result is suitable for FSS techniques which we adapted from attribute-value learning by taking into account some of the characteristics of the data due to the change of representation. We present the simple FSS proposed for the task, the requisite change of representation, and the entire method combining those two algorithms. The method acts as a filter, preprocessing the relational data, prior to the model building, which outputs relational examples with empirically relevant literals. We discuss experiments in which the method was successfully applied to two real-world domains.

Stochastic propositionalization of relational data using aggregates

The fact that data is already stored in relational databases causes many problems in the practice of data mining. To deal with this problem, one either constructs a single table by hand, or one uses a Multi-Relational Data Mining algorithm. In this paper, we propose an approach in which the single table is constructed automatically using aggregate functions, which repeatedly summarize information from dif- ferent tables over associations in the relational database. Following the construction of the single table, we recommend applying traditional data mining algorithms. Next to an in-depth discussion of our approach, the paper discusses testing of our algorithm on two well-known data sets.