Logic Programming and ILP - Laboratory 6 (original) (raw)

Basic Ideas Relevant to ILP

Ashwin Srinivasan

(ashwin@comlab.ox.ac.uk)

Oxford University Computing Laboratory

Last modified: September 12, 1999.


Contents

Section
Clausal Form
Clause Negation and Skolemization
Redundancy
A Note on Generalisation
Subsumption Term Subsumption Clause Subsumption Relation to Implication
Generalisation Hypothesis Spaces Term Lgg Literal Lgg Clause Lgg Subsumption Lattice Most Specific Hypothesis
Justification
Evaluation
Comparison


Clausal form

Usually logic programs are written as definite clauses that look like:

head_literal:- body_literal_1, body_literal_2, ... .

An example is:

grandparent(X,Z):- parent(X,Y), parent(Y,Z).

Quantification of variables is an issue that  is usually  left  out.

What  the clause above really means is:

(Forall X,Z) (grandparent(X,Z):- (Exists Y) (parent(X,Y), parent(Y,Z)))

This form  of  the  grandparent  definition  is  called  its "clausal-form", ie all variables are universally quantified, and literals are joined together by "or".

There is a methodical way to convert any logical  sen- tence  to  clausal  form.

Negating Clauses and Skolemisation

You can replace the  existentially  quantified  vari- ables in ~C with Skolem constants. Thus:

C:  (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z) ~C: (Exists X,Y,Z) ~grandparent(X,Z) and parent(X,Y) and parent(Y,Z)

we can take ~C as:

~C: ~grandparent($X,$Z) and parent($X,$Y) and parent($Y,$Z)

where $X, $Y, and $Z are Skolem constants chosen to  reflect the existential quantification on the variables.

Redundancy

ILP programs typically look to remove  redundancy  in two ways:

1.When constructing a clause, some literals in the   clause may be unnecessary,  and therefore  can  be  removed   safely (that is, the things the clause  can  prove  remains   unaffected  by   removing   the literals).  This  makes  the   clause  more  compact.

  1. A clause constructed  may make other clauses that    are already  present  unnecessary.  These can then be safely re-    moved (that is,  the  things  that  the  program  can  prove    remains unaffected if the new clause is added and these oth-    er clauses are removed).

We will first look at redundant literals in a clause.

Consider (in clausal form):

C : (Forall X,Y,Z..) L_1 or L_2 or ... or L_l or .... L_n C1: (Forall X,Y,Z..) L_1 or L_2 or ... or L_(l+1) or  .... L_n

Here X,Y,Z... are the variables in C, C1 and C1  is  C  with L_l  removed.   Remember,  for definite clauses, only one of these literals will be a positive. All  others  will  be negated.  With some background knowledge B, L_l is redundant in C iff:

        B and C is logically equivalent to B and C1

That is:

     B and C |= B and C1    and    B and C1 |= B and C

It is straightforwards to show that L_l is redundant in C iff:

                                          B and C and ~C1 |=  []

Besides useless literals in a clause, you can imagine  cases where  a  clause  may be useless in a program (i.e. a set of clauses). To find such a clause, the procedure is similar to that used above for dud literals. Thus, consider the follow- ing sets of clauses:

S : C_1 and C_2 and ... and C_l and     .... C_n S1: C_1 and C_2 and ... and C_(l+1) and ... C_n

Notice first that a set of clauses means  a  conjunction  of clauses   (each  clause  in  turn  being  a  disjunction  of literals).  In the above, S1 consists of all  clauses  in  S except C_l.  That is S = S1 and C_l

The clause C_l is said to be logically redundant in S iff:

              S is logically equivalent to S1

That is:

         S1 and C_l |= S1   and   S1 |= S1 and C_l

It is straightforward to show that C_l is redundant in S iff:

                     S1 and ~C_l |= []

A Note on Generalisation

You can see from the previous sections that definitions like:

  mem(A,[B,A|C]).

make those like:

mem(0,[0,0]). mem(1,[0,1]). mem(0,[1,0]). mem(0,[2,0]). mem(1,[1,1]).

and even:

      mem(A,[1,A|B]).

redundant. This is not surprising,  when  you  consider  the universal  quantification  on  the  variables  of the clause mem(A,[B,A|C]).  In some sense, all the clauses made  redun- dant  by  this  clause are special cases of this clause.

The process of constructing  some  general  hypothesis and then providing a justification for why it should be true can be viewed (controversially) as a form of "induction". Thus:

           Induction = Hypothesis Formation + Justification

ILP  systems form hypotheses in first-order logic. For this, obtaining hypotheses from  specific cases is tied in with: logical consequence and logical subsumption. The other  side of   induction,  namely  justification,  is  concerned  with selecting the best hypothesis amongst all those that satisfy the  logical  constraints imposed. For this, ILP draws on ideas from probability or information theory.

Subsumption

Recall that a term is recursively defines as being either:

        1. A constant (written in lower case); or         2. A variable (written is upper case); or         3. A function symbol which takes a number of terms            as arguments (such as f(a,g(X)) etc.)

Also recall that  substitutions  are  unique  mappings  from variables  to  terms  and  are  usually  written  as sets of variable/term pairs, for example {X/a, Y/f(b)}. In  this,  X and  Y  are  called the "domain" of the substitution. When a substitution theta is applied to a term t, each  variable  X in  t  that is an element of the domain of theta is replaced by the term that X maps to in theta.

Term Subsumption

Given two terms t1 and t2, t1 is  said  to  subsume  t2  iff there exists a substitution theta s.t.  t1.  theta = t2. Thus, in the following, t1.theta = t2

  t1 = f(g(A),B) t2 = f(g(h(a)),i(b))

Note that of the two terms t1 "looks" more general.

Clause Subsumption

Recall that a clause like:

C: grandparent(X,Z):- parent(X,Y), parent(Y,Z).

is really:

C:  (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z)

Yet another way that this is written is as a set of literals like so:

C:  {grandparent(X,Z), ~parent(X,Y), ~parent(Y,Z)}

Given two clauses C1 and C2, C1 is said to  subsume  C2  iff there  exists  a  substitution  theta  s.t.  C1.  theta C C2 (where C denotes the improper subset relation).

Thus for the following pair of clauses:

C1:  mem(A,[B|C]):- mem(A,C). C2:  mem(0,[1,0]):- nat(0), nat(1), mem(0,[0]).

It is the case that C1 subsumes C2.

Note that of the two clauses C1 "looks" more general.

Relation to Implication

It is the case that if a clause C1 subsumes clause C2,  then C1 logically implies C2. That is:

            If C1 subsumes C2   then   C1 |=  C2

However the reverse does not hold.

Consider the following clauses:

C1:  mem(A,[B|C]):- mem(A,C). C2:  mem(X,[H1|[H2,T]]):- mem(X,T).

C1 logically implies C2, but does not subsume it.

Generalisation in ILP

For ILP systems, the word "generalisation" has a  particular logical  meaning.  A sentence F is said to be more general than G iff:

                           F |= G

The hypothesis formation problem in this setting is then to find F, given  G. Most ILP systems do this by finding Fs that subsume G.

ILP systems for prediction are given a set of clauses as background B, positive examples E+ and negative examples E-. With these the abductive step is to find a set of clauses H, s.t. the following constraints hold:

B and E- is satisfiable  i.e.  B and E- |/= [] B and H and E- is satisfiable  i.e.  B and H and E- |/= [] B does not logically imply E+  i.e.  B |/= E+ B and H logically implies E+   i.e.  B and H |= E+

Since, for any H that satisfies these constraints:

        B and H |= B and E+

B and H are said to be "more general" than B  and  E+.  That is, the hypothesis H is said to be a "generalisation" of E+.

Typically, there are  many  (possibly  infinite)  hypotheses that would satisfy these constraints.


Hypothesis spaces

Consider constructing a set of clauses T within the  follow- ing language constraints:

        P: a set of predicate symbols         C: a set of constant symbols         V: a set of variables         a: the maximum arity of any predicate in P         c: the maximum number of literals allowed in the body            of any clause in T

An upper bound on the number of atoms A that can  be constructed given P,C,V and a is: B = P.(C+V)a

An upper bound on the number of definite  clauses  C that  can  be  chosen  with  atoms  from  A,  with at most c literals in the body is: S = B(c+1)

An upper bound on the number of clause sets (possible hypotheses) that can be constructed is: 2S

Note that the this is without any function symbols allowed.

Given some positive and  negative  examples,  not  all  such clause  sets will satisfy the logical constraints to qualify them as a hypothesis for the examples.

Rather than search this space of clause sets in any fashion, we  can  chose  to look at them in some order.  The ordering that almost all ILP systems use is based not on implication, but  on  the weaker relation of subsumption, which imposes a lattice on clauses that can be part of a hypothesis.

To really understand the subsumption lattice, you must first get  to  grips with the notion of least general generalisa- tions of terms, literals and clauses.

Given a pair of terms, literals or clauses u,v:

L is lgg of u and v iff

1) L is a common generalisation of u and v; and 2) Every other common generalisation of u and v is also a generalisation of L.

F is a common generalisation of G and H iff 1) F is a generalisation of  G; and 2) F is a generalisation of H.

F is a generalisation of G iff F subsumes G


Term Lgg

In an important thesis, Gordon Plotkin provides a  recursive method  for  constructing  lggs of two terms u,v (do not get confused by the lower case letters: u and v here do not have to be constants):

1) If u=v then lgg(u,v) = u; otherwise 2) If u is f(s1,...sn) and v is f(t1,...tn) then lgg(u,v) = f(lgg(s1,t1),...,lgg(sn,tn)); otherwise 3) lgg(u,v) is the  variable V, that represents this pair of terms throughout

Literal Lgg

In addition to the rules for  term  lggs,  the  lgg  of  two literals  is only defined when the sign and predicate symbol are the same. That is, they are both  positive  or  negative literals  with  the  same name and arity.  Such literals are said to be "compatible literals" With this restriction,  the lgg  of  literals  p(u1,....un)  and  p(v1,...vn)  is simply p(lgg(u1,v1),...,lgg(un,vn)).

Clause Lgg

We are now in a position to define the Lgg of  two  clauses. Recall  a  clause  can be thought of as a set of literals. The lgg of two clauses C1 and C2 is:

lgg(C1,C2) = {l: li in C1, and lj in C2 s.t. li, lj are compatible and l = lgg(li,lj)}

It can be  shown  that  subsumption  imposes a  lattice over equivalence classes of clauses. Here, it is sufficient to understand the broad meaning of this statement.

For representative clauses C and D in two equivalence classes there is a partial ordering ">=" based on whether one clause subsumes  the  other. Second, there is a least upper bound for C and D, which is simply the lgg  of  the  two  clauses. Third,  there is a greatest lower bound for C and D, which is simply their union.

The equivalence class of C contains all clauses that are subs- umption-equivalent to C. This is denoted by:

          [C] = {C' | C' subsumes C and C subsumes C'}

This lattice forms the basis of a clause-by-clause search in ILP.

A straightforward  procedure is:

  1. Pick a positive example e

  2. Construct the most-specific single clause hypothesis (called |)

  3. Find the best clause C above | in the subsumption lattice

  4. Remove all clauses made redundant by C and repeat search

Not all programs explicitly perform Step 2.


Most-Specific Single Clause Hypothesis

Given B and some positive example  e,  the  most  specific single  clause  hypothesis can be derived from the following steps:

We want the most specific clause | s.t.

B and | |= e

Using the Deduction Theorem:

B and ~e |= ~|

or:

| |= ~(B and ~e)

Thus, as a procedure:

  1. Calculate ~e
  2. Find the conjunction of literals derivable from B and ~e
  3. The most specific single clause hypothesis is the negation    of the conjunction in Step 2.

Since Step 2  produces a conjunction of  literals,  Step  3, which  negates  this  conjunction, produces a disjunction of literals, which is a clause.

The most-specific clause may be infinitely long. In practice the best that programs can construct is some finite clause that implies this infinite clause. This is done using constructs like a depth-bounded mode language. The most (finite) most-specific clause in this language implies the true | and forms the basis for the clause-by-clause search.


Justification

There has to be some reason for preferring one hypothesis over another. A probabilistic setting for providing justification for the hypothesis selected is that afforded by Bayes Rule:

Pr(Hyp|Examples) = PriorPr(Hyp).Pr(Examples|Hyp)/PriorPr(Examples)

An information-theoretic version of this follows from taking negative logs of both sides of this equation:

Info(Hyp|Examples) = Info(Hyp) + Info(Examples|Hyp) - Info(Examples)

where Info(X) is the number of bits required to encode X.

This forms the basis of the Minimum Description Length (MDL) principle that selects hypotheses that minimise Info(Hyp|Examples). This is equivalent to maximation of Bayesian posterior probability.


Evaluation

Evaluation is concerned with estimating the accuracy of a hypothesis for a particular target concept and example set.

Accuracy is measured according to some probability distribution D over the examples. It is simply the probability of drawing an example, under D, that the hypothesis misclassifies.

The simplest way of estimating this probability is to get more labelled examples (a ``test'' set). If m of n new examples are correctly classified by the hypothesis, then an unbiased estimate of its accuracy is m/n. In fact if the true accuracy is p, then the number of correct classifications is  binomially distributed with mean np and variance np(1-p).

For many practical problems, getting a new set of labelled examples is not a viable option. This motivates the use of resampling methods. Here the available data of n examples are partitioned into k sets. A hypothesis is obtained by ignoring in turn, the data in each of these sets. For each hypothesis the classification on the set ignored is then obtained. The accuracy of the hypothesis is then the proportion m/n of correct classifications made on these ``left-out'' datasets. Strictly speaking the variance expression above does not hold with this technique.


Comparison

Comparing the performance of a pair of algorithms using involves testing the null hypothesis that there is no difference in their accuracies.

The simplest way of comparing a pair of algorithms is to record their performance on n independently drawn test'' examples. The comparison is then analogous to a coin-tossing experiment where a head'' occurs if Algorithm 1 does better than Algorithm 2. If there is no difference in the algorithms then we would expect n/2 heads. Since the number of heads with a fair coin is governed by the binomial distribution with probability of success equal to 0.5, we can calculate the probability of obtaining at least the number of heads observed. This can be compared to an acceptable significance level.

Care must be taken with repeated testing of pairs of algorithms. With sufficient comparisons, it is possible simply by chance, of obtaining one where Algorithm 1  looks better than its adversary. This has to be accounted for by statistical corrections on the acceptable significance level.