Logic Programming and ILP - Laboratory 6 (original) (raw)
Basic Ideas Relevant to ILP
Ashwin Srinivasan
Oxford University Computing Laboratory
Last modified: September 12, 1999.
Contents
Clausal form
Usually logic programs are written as definite clauses that look like:
head_literal:- body_literal_1, body_literal_2, ... .
An example is:
grandparent(X,Z):- parent(X,Y), parent(Y,Z).
Quantification of variables is an issue that is usually left out.
What the clause above really means is:
(Forall X,Z) (grandparent(X,Z):- (Exists Y) (parent(X,Y), parent(Y,Z)))
This form of the grandparent definition is called its "clausal-form", ie all variables are universally quantified, and literals are joined together by "or".
There is a methodical way to convert any logical sen- tence to clausal form.
Negating Clauses and Skolemisation
You can replace the existentially quantified vari- ables in ~C with Skolem constants. Thus:
C: (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z) ~C: (Exists X,Y,Z) ~grandparent(X,Z) and parent(X,Y) and parent(Y,Z)
we can take ~C as:
~C: ~grandparent($X,$Z) and parent($X,$Y) and parent($Y,$Z)
where $X, $Y, and $Z are Skolem constants chosen to reflect the existential quantification on the variables.
Redundancy
ILP programs typically look to remove redundancy in two ways:
1.When constructing a clause, some literals in the clause may be unnecessary, and therefore can be removed safely (that is, the things the clause can prove remains unaffected by removing the literals). This makes the clause more compact.
- A clause constructed may make other clauses that are already present unnecessary. These can then be safely re- moved (that is, the things that the program can prove remains unaffected if the new clause is added and these oth- er clauses are removed).
We will first look at redundant literals in a clause.
Consider (in clausal form):
C : (Forall X,Y,Z..) L_1 or L_2 or ... or L_l or .... L_n C1: (Forall X,Y,Z..) L_1 or L_2 or ... or L_(l+1) or .... L_n
Here X,Y,Z... are the variables in C, C1 and C1 is C with L_l removed. Remember, for definite clauses, only one of these literals will be a positive. All others will be negated. With some background knowledge B, L_l is redundant in C iff:
B and C is logically equivalent to B and C1
That is:
B and C |= B and C1 and B and C1 |= B and C
It is straightforwards to show that L_l is redundant in C iff:
B and C and ~C1 |= []
Besides useless literals in a clause, you can imagine cases where a clause may be useless in a program (i.e. a set of clauses). To find such a clause, the procedure is similar to that used above for dud literals. Thus, consider the follow- ing sets of clauses:
S : C_1 and C_2 and ... and C_l and .... C_n S1: C_1 and C_2 and ... and C_(l+1) and ... C_n
Notice first that a set of clauses means a conjunction of clauses (each clause in turn being a disjunction of literals). In the above, S1 consists of all clauses in S except C_l. That is S = S1 and C_l
The clause C_l is said to be logically redundant in S iff:
S is logically equivalent to S1
That is:
S1 and C_l |= S1 and S1 |= S1 and C_l
It is straightforward to show that C_l is redundant in S iff:
S1 and ~C_l |= []
A Note on Generalisation
You can see from the previous sections that definitions like:
mem(A,[B,A|C]).
make those like:
mem(0,[0,0]). mem(1,[0,1]). mem(0,[1,0]). mem(0,[2,0]). mem(1,[1,1]).
and even:
mem(A,[1,A|B]).
redundant. This is not surprising, when you consider the universal quantification on the variables of the clause mem(A,[B,A|C]). In some sense, all the clauses made redun- dant by this clause are special cases of this clause.
The process of constructing some general hypothesis and then providing a justification for why it should be true can be viewed (controversially) as a form of "induction". Thus:
Induction = Hypothesis Formation + Justification
ILP systems form hypotheses in first-order logic. For this, obtaining hypotheses from specific cases is tied in with: logical consequence and logical subsumption. The other side of induction, namely justification, is concerned with selecting the best hypothesis amongst all those that satisfy the logical constraints imposed. For this, ILP draws on ideas from probability or information theory.
Subsumption
Recall that a term is recursively defines as being either:
1. A constant (written in lower case); or 2. A variable (written is upper case); or 3. A function symbol which takes a number of terms as arguments (such as f(a,g(X)) etc.)
Also recall that substitutions are unique mappings from variables to terms and are usually written as sets of variable/term pairs, for example {X/a, Y/f(b)}. In this, X and Y are called the "domain" of the substitution. When a substitution theta is applied to a term t, each variable X in t that is an element of the domain of theta is replaced by the term that X maps to in theta.
Term Subsumption
Given two terms t1 and t2, t1 is said to subsume t2 iff there exists a substitution theta s.t. t1. theta = t2. Thus, in the following, t1.theta = t2
t1 = f(g(A),B) t2 = f(g(h(a)),i(b))
Note that of the two terms t1 "looks" more general.
Clause Subsumption
Recall that a clause like:
C: grandparent(X,Z):- parent(X,Y), parent(Y,Z).
is really:
C: (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z)
Yet another way that this is written is as a set of literals like so:
C: {grandparent(X,Z), ~parent(X,Y), ~parent(Y,Z)}
Given two clauses C1 and C2, C1 is said to subsume C2 iff there exists a substitution theta s.t. C1. theta C C2 (where C denotes the improper subset relation).
Thus for the following pair of clauses:
C1: mem(A,[B|C]):- mem(A,C). C2: mem(0,[1,0]):- nat(0), nat(1), mem(0,[0]).
It is the case that C1 subsumes C2.
Note that of the two clauses C1 "looks" more general.
Relation to Implication
It is the case that if a clause C1 subsumes clause C2, then C1 logically implies C2. That is:
If C1 subsumes C2 then C1 |= C2
However the reverse does not hold.
Consider the following clauses:
C1: mem(A,[B|C]):- mem(A,C). C2: mem(X,[H1|[H2,T]]):- mem(X,T).
C1 logically implies C2, but does not subsume it.
Generalisation in ILP
For ILP systems, the word "generalisation" has a particular logical meaning. A sentence F is said to be more general than G iff:
F |= G
The hypothesis formation problem in this setting is then to find F, given G. Most ILP systems do this by finding Fs that subsume G.
ILP systems for prediction are given a set of clauses as background B, positive examples E+ and negative examples E-. With these the abductive step is to find a set of clauses H, s.t. the following constraints hold:
B and E- is satisfiable i.e. B and E- |/= [] B and H and E- is satisfiable i.e. B and H and E- |/= [] B does not logically imply E+ i.e. B |/= E+ B and H logically implies E+ i.e. B and H |= E+
Since, for any H that satisfies these constraints:
B and H |= B and E+
B and H are said to be "more general" than B and E+. That is, the hypothesis H is said to be a "generalisation" of E+.
Typically, there are many (possibly infinite) hypotheses that would satisfy these constraints.
Hypothesis spaces
Consider constructing a set of clauses T within the follow- ing language constraints:
P: a set of predicate symbols C: a set of constant symbols V: a set of variables a: the maximum arity of any predicate in P c: the maximum number of literals allowed in the body of any clause in T
An upper bound on the number of atoms A that can be constructed given P,C,V and a is: B = P.(C+V)a
An upper bound on the number of definite clauses C that can be chosen with atoms from A, with at most c literals in the body is: S = B(c+1)
An upper bound on the number of clause sets (possible hypotheses) that can be constructed is: 2S
Note that the this is without any function symbols allowed.
Given some positive and negative examples, not all such clause sets will satisfy the logical constraints to qualify them as a hypothesis for the examples.
Rather than search this space of clause sets in any fashion, we can chose to look at them in some order. The ordering that almost all ILP systems use is based not on implication, but on the weaker relation of subsumption, which imposes a lattice on clauses that can be part of a hypothesis.
To really understand the subsumption lattice, you must first get to grips with the notion of least general generalisa- tions of terms, literals and clauses.
Given a pair of terms, literals or clauses u,v:
L is lgg of u and v iff
1) L is a common generalisation of u and v; and 2) Every other common generalisation of u and v is also a generalisation of L.
F is a common generalisation of G and H iff 1) F is a generalisation of G; and 2) F is a generalisation of H.
F is a generalisation of G iff F subsumes G
Term Lgg
In an important thesis, Gordon Plotkin provides a recursive method for constructing lggs of two terms u,v (do not get confused by the lower case letters: u and v here do not have to be constants):
1) If u=v then lgg(u,v) = u; otherwise 2) If u is f(s1,...sn) and v is f(t1,...tn) then lgg(u,v) = f(lgg(s1,t1),...,lgg(sn,tn)); otherwise 3) lgg(u,v) is the variable V, that represents this pair of terms throughout
Literal Lgg
In addition to the rules for term lggs, the lgg of two literals is only defined when the sign and predicate symbol are the same. That is, they are both positive or negative literals with the same name and arity. Such literals are said to be "compatible literals" With this restriction, the lgg of literals p(u1,....un) and p(v1,...vn) is simply p(lgg(u1,v1),...,lgg(un,vn)).
Clause Lgg
We are now in a position to define the Lgg of two clauses. Recall a clause can be thought of as a set of literals. The lgg of two clauses C1 and C2 is:
lgg(C1,C2) = {l: li in C1, and lj in C2 s.t. li, lj are compatible and l = lgg(li,lj)}
Subsumption Lattice and Search
It can be shown that subsumption imposes a lattice over equivalence classes of clauses. Here, it is sufficient to understand the broad meaning of this statement.
For representative clauses C and D in two equivalence classes there is a partial ordering ">=" based on whether one clause subsumes the other. Second, there is a least upper bound for C and D, which is simply the lgg of the two clauses. Third, there is a greatest lower bound for C and D, which is simply their union.
The equivalence class of C contains all clauses that are subs- umption-equivalent to C. This is denoted by:
[C] = {C' | C' subsumes C and C subsumes C'}
This lattice forms the basis of a clause-by-clause search in ILP.
A straightforward procedure is:
Pick a positive example e
Construct the most-specific single clause hypothesis (called |)
Find the best clause C above | in the subsumption lattice
Remove all clauses made redundant by C and repeat search
Not all programs explicitly perform Step 2.
Most-Specific Single Clause Hypothesis
Given B and some positive example e, the most specific single clause hypothesis can be derived from the following steps:
We want the most specific clause | s.t.
B and | |= e
Using the Deduction Theorem:
B and ~e |= ~|
or:
| |= ~(B and ~e)
Thus, as a procedure:
- Calculate ~e
- Find the conjunction of literals derivable from B and ~e
- The most specific single clause hypothesis is the negation of the conjunction in Step 2.
Since Step 2 produces a conjunction of literals, Step 3, which negates this conjunction, produces a disjunction of literals, which is a clause.
The most-specific clause may be infinitely long. In practice the best that programs can construct is some finite clause that implies this infinite clause. This is done using constructs like a depth-bounded mode language. The most (finite) most-specific clause in this language implies the true | and forms the basis for the clause-by-clause search.
Justification
There has to be some reason for preferring one hypothesis over another. A probabilistic setting for providing justification for the hypothesis selected is that afforded by Bayes Rule:
Pr(Hyp|Examples) = PriorPr(Hyp).Pr(Examples|Hyp)/PriorPr(Examples)
An information-theoretic version of this follows from taking negative logs of both sides of this equation:
Info(Hyp|Examples) = Info(Hyp) + Info(Examples|Hyp) - Info(Examples)
where Info(X) is the number of bits required to encode X.
This forms the basis of the Minimum Description Length (MDL) principle that selects hypotheses that minimise Info(Hyp|Examples). This is equivalent to maximation of Bayesian posterior probability.
Evaluation
Evaluation is concerned with estimating the accuracy of a hypothesis for a particular target concept and example set.
Accuracy is measured according to some probability distribution D over the examples. It is simply the probability of drawing an example, under D, that the hypothesis misclassifies.
The simplest way of estimating this probability is to get more labelled examples (a ``test'' set). If m of n new examples are correctly classified by the hypothesis, then an unbiased estimate of its accuracy is m/n. In fact if the true accuracy is p, then the number of correct classifications is binomially distributed with mean np and variance np(1-p).
For many practical problems, getting a new set of labelled examples is not a viable option. This motivates the use of resampling methods. Here the available data of n examples are partitioned into k sets. A hypothesis is obtained by ignoring in turn, the data in each of these sets. For each hypothesis the classification on the set ignored is then obtained. The accuracy of the hypothesis is then the proportion m/n of correct classifications made on these ``left-out'' datasets. Strictly speaking the variance expression above does not hold with this technique.
Comparison
Comparing the performance of a pair of algorithms using involves testing the null hypothesis that there is no difference in their accuracies.
The simplest way of comparing a pair of algorithms is to record their
performance on n independently drawn test'' examples. The comparison is then analogous to a coin-tossing experiment where a
head'' occurs if Algorithm 1
does better than Algorithm 2. If there is no difference in the algorithms then
we would expect n/2 heads. Since the number of heads with a fair coin is
governed by the binomial distribution with probability of success equal to 0.5,
we can calculate the probability of obtaining at least the number of
heads observed. This can be compared to an acceptable significance level.
Care must be taken with repeated testing of pairs of algorithms. With sufficient comparisons, it is possible simply by chance, of obtaining one where Algorithm 1 looks better than its adversary. This has to be accounted for by statistical corrections on the acceptable significance level.