StringKernel (original) (raw)

Implementation of the subsequence kernel (SSK) as described in [1] and of the subsequence kernel with lambda pruning (SSK-LP) as described in [2].

For more information, see

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of Machine Learning Research. 2:419-444.

F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA. Wien, Austria.

BibTeX:

@article{Lodhi2002, author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins}, journal = {Journal of Machine Learning Research}, pages = {419-444}, title = {Text Classification using String Kernels}, volume = {2}, year = {2002}, HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html} }

@techreport{Kleedorfer2005, address = {Wien, Austria}, author = {F. Kleedorfer and A. Seewald}, institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence}, number = {TR-2005-13}, title = {Implementation of a String Kernel for WEKA}, year = {2005} }

Valid options are:

-D Enables debugging output (if available) to be printed. (default: off)

-no-checks Turns off all checks - use with caution! (default: checks on)

-P <0|1> The pruning method to use: 0 = No pruning 1 = Lambda pruning (default: 0)

-C The size of the cache (a prime number). (default: 250007)

-IC The size of the internal cache (a prime number). (default: 200003)

-L The lambda constant. Penalizes non-continuous subsequence matches. Must be in (0,1). (default: 0.5)

-ssl The length of the subsequence. (default: 3)

-ssl-max The maximum length of the subsequence. (default: 9)

-N Use normalization. (default: no)

Theory

Overview

The algorithm computes a measure of similarity between two texts based on the number and form of their common subsequences, which need not be contiguous. This method can be parametrized by specifying the subsequence length k, the penalty factor lambda, which penalizes non-contiguous matches, and optional 'lambda pruning', which takes maxLambdaExponent,m, as parameter. Lambda pruning causes very 'stretched' substring matches not to be counted, thus speeding up the computation. The functionality of SSK and SSK-LP is explained in the following using simple examples.

Explanation & Examples

for all of the following examples, we assume these parameter values:

k=2 lambda=0.5 m=8 (for SSK-LP examples)

SSK

Example 1

SSK(2,"ab","axb")=0.5^5 = 0,03125

There is one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is computed by raising lambda to the power of L, where L is the length of the subsequence match in the one string plus the length of the subsequence match in the other, in our case:

ab axb L= 2 + 3 = 5

hence, the kernel yields 0.5^5 = 0,03125

Example 2

SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375

Here, we also have one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is actually computed by summing over all values computed for each occurrence of a common subsequence match. In this example, there are two possible cases:

ab abb -- -- L=4 -- - - L=5

we have two matches, one of the length of 2+2=4, one of the length of 2+3=5, so we get the result 0.5^5 + 0.5^4 = 0,09375.

SSK-LP

Without lambda pruning, the string kernel finds *all* common subsequences of the given length, whereas with lambda pruning, common subsequence matches that are too much stretched in both strings are not taken into account. It is argued that the value yielded for such a common subsequence is too low (lambda ^(length[match_in_s] + length[match_in_t]) . Tests have shown that a tremendous speedup can be achieved using this technique while suffering from very little quality loss.
Lambda pruning is parametrized by the maximum lambda exponent. It is a good idea to choose that value to be about 3 or 4 times the subsequence length as a rule of thumb. YMMV.

Example 3

Without lambda pruning, one common subsequence, "AB" would be found in the following two strings. (With k=2)

SSK(2,"ab","axb")=0.5^14 = 0,00006103515625

lambda pruning allows for the control of the match length. So, if m (the maximum lambda exponent) is e.g. 8, these two strings would yield a kernel value of 0:

with lambda pruning: SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0 without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625

This is because the exponent for lambda (=the length of the subsequence match) would be 14, which is > 8. In Contrast, the next result is > 0

m=8 SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625

because the lambda exponent would be 8, which is just accepted by lambda pruning.

Normalization

When the string kernel is used for its main purpose, as the kernel of a support vector machine, it is not normalized. The normalized kernel can be switched on by -F (feature space normalization) but is much slower. Like most unnormalized kernels, K(x,x) is not a fixed value, see the next example.

Example 4

SSK(2,"ab","ab")=0.5^4 = 0.0625 SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478

SSK is evaluated twice, each time for two identical strings. A good measure of similarity would produce the same value in both cases, which should indicate the same level of similarity. The value of the normalized SSK would be 1.0 in both cases. So for the purpose of computing string similarity the normalized kernel should be used. For SVM the unnormalized kernel is usually sufficient.

Complexity of SSK and SSK-LP

The time complexity of this method (without lambda pruning and with an infinitely large cache) is

O(k*|s|*|t|)

Lambda Pruning has a complexity (without caching) of

O(mbinom(m,k)^2(|s|+n)*|t|)

k... subsequence length (ssl) s,t... strings |s|... length of string s binom(x,y)... binomial coefficient (x!/[(x-y)!y!]) m... maxLambdaExponent (ssl-max)

Keep in mind that execution time can increase fast for long strings and big values for k, especially if you don't use lambda pruning. With lambda pruning, computation is usually so fast that switching on the cache leads to slower computation because of setup costs. Therefore caching is switched off for lambda pruning.

For details and qualitative experiments about SSK, see [1]
For details about lambda pruning and performance comparison of SSK and SSK-LP (SSK with lambda pruning), see [2] Note that the complexity estimation in [2] assumes no caching of intermediate results, which has been implemented in the meantime and greatly improves the speed of the SSK without lambda pruning.

Notes for usage within Weka

Only instances of the following form can be processed using string kernels:

+----------+-------------+---------------+ |attribute#| 0 | 1 | +----------+-------------+---------------+ | content | [text data] | [class label] | +----------------------------------------+ ... or ... +----------+---------------+-------------+ |attribute#| 0 | 1 | +----------+---------------+-------------+ | content | [class label] | [text data] | +----------------------------------------+