SVM-div (original) (raw)

Author:
Yisong Yue <yisongyue@gmail.com>

Version: 1.02
Date: 1/07/2011

Overview
SVMdiv is a Support Vector Machine (SVM) algorithm for predicting diverse subsets (of documents). It a supervised learning approach to selecting for diversity (in information retrieval). Rather than predicting rankings (as is commonly done in information retrieval), SVMdiv learns to predict (i.e. retrieve) a subset of the candidate documents. Given training data with labeled subtopics, SVMdiv learns model parameters with the goal of minimizing subtopic loss in the predicted subsets. It is an implementation of the SVMdiv method described in [1] (with a specific joint feature map formulation).

Predicting subsets can be thought of as type of structured prediction. SVMdiv is implemented using SVMpython, which exposes a Python interface to SVMstruct. SVMstruct is a general SVM framework for learning structured prediction tasks and was developed by Thorsten Joachims. For more algorithmic details, refer to [1] for SVMdiv and [2] for SVMstruct.

Source Code & Data Files
You can download the source code of SVMdiv from the following location: http://projects.yisongyue.com/svmdiv/downloads/svm-div-v1.02.tar.gz The package above also contains a dataset based on the TREC 6-8 Interactive Track.Compiling
Currently, SVM-div only works in a Linux environment. Windows users can run SVM-div within Cygwin. To compile, simply run 'make' in the svm-div directory.

**NOTE** - SVM-div does require Python version 2.4 or newer in order to run properly. Within the Makefile is the following line:

PYTHON = python This line indicates the location of the Python program to use. Currently, it is set to the default Python program. If your default Python program is older than version 2.4, then you will need to change this line to the location of a newer version of Python. For example:
PYTHON = /opt/bin/python2.4 You can download the latest version of Python at http://www.python.org/download/Input Data Format
The input data file which SVM-div reads is an index file with the path+filename of all data files. Each data file should contain all the documents (aka examples) for a single query (aka set of examples).

Within a data file, each line contains the information for a single document. SVM-div assumes all documents are represented using word frequency counts oebying the following format:

[label] [word_id]:[doc_word_freq] [word_id]:[doc_word_freq] ... Labels are represented as binary strings (e.g., '10010'), where each digit corresponds to a subtopic for that query, and 1/0 indicates relevance/non-relevance. All documents should be relevant to at least one subtopic. Subtopics are allowed to change from query to query.

In addition to document word frequencies, SVM-div also uses title words. A word is denoted as being in the title be prepending the word entry with 'T'.

Features are represented sparsely. For each document, only non-zero word frequencies need to be stored in the data file. For example, a document could be represented as:

01100 T1:0.25 2:0.25 5:0.5 Here, this document is relevant to subtopics 2 & 3 for this particular query. Word 1 has frequency 0.25, word 2 has frequency 0.25, and word 3 has frequency 0.5. Notice the 'T' designation in fron the entry for word 1. This indicates that word 1 is also present in the title of the document. SVM-div ignores the frequency of words in the title, and only considers whether a word is present in the title.

All word IDs should be invariant for documents in a single query. Word IDs can change from query to query.

Refer to train_index_file and test_index_file for example index files. The data files are in the folder TREC_Interactive_Subtopic, and are the data files used for the experiments in [1].

Model Config File & Feature Implementation
The file 'config.txt' contains configuration settings for SVM-div.