Approximate Relational Reasoning by Stochastic Propositionalization (original) (raw)
Related papers
Approximate Reasoning for Efficient Anytime Induction from Relational Knowledge Bases
Lecture Notes in Computer Science, 2008
In most real-world applications the choice of the right representation language represents a fundamental issue, since it may give opportunities for generalization and make inductive reasoning computationally easier or harder. While the setting of First Order Logic (FOL) is the most suitable one to model the multirelational data of real and complex domains, on the other hand it puts the question of the computational complexity of the knowledge induction that represents a challenge for multi-relational data mining algorithms. Indeed, the complexity of most real domains, in which a lot of relationships are required to model the objects involved, calls for both an efficient and effective search method for exploring the space of candidate solutions and a deduction procedure assessing the validity of the discovered knowledge. A way of tackling the complexity of such domains is to use a method that reformulates a multi-relational learning task into an attribute-value one. In this paper we propose an approximate reasoning technique that decreases the complexity of a relational problem changing both the language and the inference operation used for the deduction. The complexity of the FOL language is decreased by means of a stochastic propositionalization method, while the NP-completeness of the deduction is tackled using an approximate query evaluation. The induction is performed with an anytime algorithm, implemented by a population based method, able to efficiently extract knowledge from structured data in form of complete FOL definitions. The validity of the proposed technique has been proved making an empirical evaluation on a real-world dataset.
Stochastic Propositionalization for Efficient Multi-relational Learning
Lecture Notes in Computer Science, 2008
The efficiency of multi-relational data mining algorithms, addressing the problem of learning First Order Logic (FOL) theories, strongly depends on the search method used for exploring the hypotheses space and on the coverage test assessing the validity of the learned theory against the training examples. A way of tackling the complexity of this kind of learning systems is to use a propositional method that reformulates a multi-relational learning problem into an attribute-value one. We propose a population based algorithm that using a stochastic propositional method efficiently learns complete FOL definitions.
Stochastic propositionalization of relational data using aggregates
The fact that data is already stored in relational databases causes many problems in the practice of data mining. To deal with this problem, one either constructs a single table by hand, or one uses a Multi-Relational Data Mining algorithm. In this paper, we propose an approach in which the single table is constructed automatically using aggregate functions, which repeatedly summarize information from dif- ferent tables over associations in the relational database. Following the construction of the single table, we recommend applying traditional data mining algorithms. Next to an in-depth discussion of our approach, the paper discusses testing of our algorithm on two well-known data sets.
Springer eBooks, 2011
Biological processes where every gene and protein participates is an essential knowledge for designing disease treatments. Nowadays, these annotations are still unknown for many genes and proteins. Since making annotations from in-vivo experiments is costly, computational predictors are needed for different kinds of annotation such as metabolic pathway, interaction network, protein family, tissue, disease and so on. Biological data has an intrinsic relational structure, including genes and proteins, which can be grouped by many criteria. This hinders the possibility of finding good hypotheses when attribute-value representation is used. Hence, we propose the generic Modular Multi-Relational Framework (MMRF) to predict different kinds of gene and protein annotation using Relational Data Mining (RDM). The specific MMRF application to annotate human protein with diseases verifies that group knowledge (mainly protein-protein interaction pairs) improves the prediction, particularly doubling the area under the precision-recall curve.
Probabilistic Classification and Clustering in Relational Data
2001
Supervised and unsupervised learning methods have traditionally focused on data consisting of independent instances of a single type. However, many real-world domains are best described by relational models in which instances of multiple types are related to each other in complex ways. For example, in a scientific paper domain, papers are related to each other via citation, and are also related to their authors. In this case, the label of one entity (e.g., the topic of the paper) is often correlated with the labels of related entities. We propose a general class of models for classification and clustering in relational domains that capture probabilistic dependencies between related instances. We show how to learn such models efficiently from data. We present empirical results on two real world data sets. Our experiments in a transductive classification setting indicate that accuracy can be significantly improved by modeling relational dependencies. Our algorithm automatically induces a very natural behavior, where our knowledge about one instance helps us classify related ones, which in turn help us classify others. In an unsupervised setting, our models produced coherent clusters with a very natural interpretation, even for instance types that do not have any attributes.
Stochastic propositionalization of non-determinate background knowledge
Both propositional and relational learning algorithms require a good representation to perform well in practice. Usually such a representation is either engineered manually by domain experts or derived automatically by means of so-called constructive induction. Inductive Logic Programming (ILP) algorithms put a somewhat less burden on the data engineering effort as they allow for a structured, relational representation of background knowledge. In chemical and engineering domains, a common representational device for graph-like structures are so-called non-determinate relations. Manually engineered features in such domains typically test for or count occurrences of specific substructures having specific properties. However, representations containing non-determinate relations pose a serious efficiency problem for most standard ILP algorithms. Therefore, we have devised a stochastic algorithm to automatically derive features from non-determinate background knowledge. The algorithm conducts a top-down search for first-order clauses, where each clause represents a binary feature. These features are used instead of the non-determinate relations in a subsequent induction step. In contrast to comparable algorithms search is not class-blind and there are no arbitrary size restrictions imposed on candidate clauses. An empirical investigation in three chemical domains supports the validity and usefulness of the proposed algorithm.
A first-order representation for knowledge discovery and bayesian classification on relational data
2000
In this paper we consider different representations for relational learning problems, with the aim of making ILP methods more applicable to real-world problems. In the past, ILP tended to concentrate on the term representation, with the flattened Datalog representation as a 'poor man's version'. There has been relatively little emphasis on database-oriented representations, using e.g. the relational datamodel or the Entity-Relationship model. On the other hand, much of the available data is stored in multi-relational databases. Even if we don't actually interface our ILP systems with a DBMS, we need to understand the database representation sufficiently in order to convert it to an ILP representation. Such conversions and relations between different representations are the subject of this paper. We consider four different representations: the Entity-Relationship model, the relational model, a flattened individual-centred representation based on socalled ISP declarations we use for our ILP systems Tertius and 1BC, and the term-based representation. We argue that the term-based representation does not have all the flexibility and expressiveness provided by the other representations. For instance, there is no way to deal with graphs without partly flattening the data (i.e., introducing identifiers). Furthermore, there is no easy way of switching to another individual without converting the data, let alone learning with different individual types. The flattened representation has clear advantages in these respects.
Ensemble Relational Learning based on Selective Propositionalization
Dealing with structured data needs the use of expressive representation formalisms that, however, puts the problem to deal with the computational complexity of the machine learning process. Furthermore, real world domains require tools able to manage their typical uncertainty. Many statistical relational learning approaches try to deal with these problems by combining the construction of relevant relational features with a probabilistic tool. When the combination is static (static propositionalization), the constructed features are considered as boolean features and used offline as input to a statistical learner; while, when the combination is dynamic (dynamic propositionalization), the feature construction and probabilistic tool are combined into a single process. In this paper we propose a selective propositionalization method that search the optimal set of relational features to be used by a probabilistic learner in order to minimize a loss function. The new propositionalization approach has been combined with the random subspace ensemble method. Experiments on real-world datasets shows the validity of the proposed method.
Research on Statistical Relational Learning
2007
This paper presents an overview of the research on learning statistical models of relational data being carried out at the University of Washington. Our work falls into five main directions: learning models of social networks; learning models of sequential relational processes; scaling up statistical relational learning to massive data sources; learning for knowledge integration; and learning programs in procedural languages. We describe some of the common themes and research issues arising from this work.