Ted Pedersen - Free Software for Natural Language Processing from the NLP group at UMD (original) (raw)
[This page is out of date. Please contact me for more current info.]
This is a directory of software developed by the Natural Language Processing Group at the University of Minnesota, Duluth. It is mostly in Perl, and always freely available under the terms of the GNU General Public License (GPL). Many of these projects are available via CPAN and SourceForge.
Unsupervised Corpus Based Clustering of Similar Contexts
SenseClusters is a package of Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods. These techniques have been applied to word sense discrimination, email categorization, and name discrimination.
Collocation Identification
NSP allows you to identify word n-grams in large corpora using standard tests of association such as Fisher's exact test, the log likelihood ratio, Pearson's chi-squared text, and the Dice Coefficient.
WordNet Resources
WordNet::Similarity allows you to measure the similarity and relatedness of two concepts in the WordNet lexical database using a variety of measures of semantic similarity and relatedness.
WordNet::SenseRelate allows you to assign meanings to each content word in a text. It does this by determining which sense of a word is most related to its neighbors.
A few misc. programs that help us deal with WordNet.
UMLS Resources
UMLS::Similarity allows you to measure the similarity and relatedness of two concepts in the Unified Medical Language Subsystem (UMLS) using a variety of measures of semantic similarity and relatedness.
UMLS::Interface provides a Perl interface to the Unified Medical Language System (UMLS) and provides much of the functionality that enables UMLS::Similarity.
Supervised Methods of Word Sense Disambiguation
This is a suite a tools that allow for easy creation of supervised word sense disambiguation experiments.
This is a greatly improved version of the Duluth-Shell as used in the DuluthX Senseval-2 systems. It makes it easier to run large numbers of experiments, and provides many detailed reporting options.
This extends the Duluth Senseval-2 systems with part of speech and syntactic features. This system participated in Senseval-3 (2004).
Complete source code and documentation for the Duluth systems that participated in the Senseval-3 (2004) comparative exercise among word sense disambiguation systems. This includes supervised lexical sample systems based on the Duluth Senseval-2 systems, and a new unsupervised lexical sample system.
Complete source code and documentation for the Duluth systems that participated in the lexical sample tasks of Senseval-2 (2001) comparative exercise among word sense disambiguation systems. These systems rely on lexical features like unigrams, bigrams, and co-occurrences.
This is a complete word sense disambiguation system that integrates NSP and Weka into the Gate environment.
This is a complete word sense disambiguation system that assigns senses to biomedical text based on the UMLS.
Data and Data Creation Tools
We support conversions of data in a number of formats into the Senseval-2 format for lexical sample word sense disambiguation. You can find those tools here!
We have converted a variety of sense-tagged text into the Senseval-2 format. We provide both copies of the converted data as well as the source code used to create it.
Process Senseval-2 formatted data using the Brill POS Tagger and the Collins Parser.
Tools for automatic and manual alignment of parallel text.
Web Mining
GoogleHack finds sets of related words using the Google search engine.
By:Ted Pedersen- tpederse AT d umn edu