The Stanford NLP Group (original) (raw)

Software > Biomedical Event Parser

Stanford Biomedical Event Parser (SBEP)

Event Extraction for the BioNLP 2009/2011 shared task

About |Downloads |Usage |Questions |Release history

About

This software is the event parser component from the Stanford and FAUST submissions to the BioNLP shared task. It does not include the event reranker component currently (this performance of the parser alone is generally around 0.5-1% lower than the reranked parser). The event parser system is described in

David McClosky,Mihai Surdeanu, and Christopher D. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of the Association for Computational Linguistics - Human Language Technologies 2011 Conference (ACL-HLT 2011), Main Conference. [pdf,bib]

The underlying parser is MSTParser, created by Ryan McDonald and Jason Baldridge. Thanks to them for making their code available! The Stanford Event Parser code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software. The download requires Java 1.6.

Downloads

Download Stanford Biomedical Event Parser code (version 1.0, 1.9MB)

Download our parses, tokenizations, and triggers (version 1.0, 13MB)

Download our PubMed distributional similarity word clusters (version 1.0, 2MB)

Download Stanford CoreNLP (version 1.1.0, no models, 31MB)

Usage

Obtaining BioNLP shared task data
These are part of the training data for building event parser models. These can be found here. If you use ourparses, tokenizations, and triggers archive, you do not need anything from the supporting downloads page.
Obtaining libraries
You will need the jar files from Stanford CoreNLP(version 1.1.0 -- if you require the Event Parser to work against a different version of CoreNLP, please let us know) as well as GNU Trove (version 2.0.4 works for us, version 3 currently does not) to run this. You will need to add libraries from these files to your _classpath_. From Stanford CoreNLP, add stanford-corenlp-_version_.jar (where _version_ is the version of the Stanford CoreNLP distribution), jgrapht.jar, jgraph.jar and xom.jar. From GNU Trove, add the appropriate trove-_versionnumber_.jar file.

Filesystem setup
The event parser expects files to have a specific directory structure. This directory should be rooted in the base.directoryproperty. Be sure to set this property in all of your properties files. For the purposes of this documentation, this location will be referred to as _base_. Inside of _base_, you should have a subdirectories for each dataset that you want to use. Each dataset is identified by a "shortName": GENIA is genia, Epigenetics is epi, and Infectious Diseases is infect. If you're using the combined GENIA and Infectious Diseases (per our experiments on Infectious Diseases), the shortName is infect++5x. You should extract the parses, tokenizations, and triggersarchive inside the _base_ directory.

Directory	Contents
base/shortName/stanford-tokenizations	Should contain a sentence-segmented and tokenized version of each document (regardless of whether the document is part of training, testing, etc.). Each file should be of the form docID.tok(e.g. PMID-9878621.tok). In each file, sentences should be newline separated and words should be separated by spaces.
base/shortName/stanford-mccc-parses	Should contain a parse for each document (regardless of whether the document is part of training, testing, etc.) from the biomedical McClosky-Charniak-Johnson parser. Each file should be of the form docID.ptb(e.g. PMID-9878621.ptb). (These can be found in the distributed parses and tokenizations file.)
base/shortName/umass-tokenizations (optional)	Same format as stanford-tokenizations but used whendataset.tokenizer=umass. These tokenizations were made bySebastian Riedel.
base/shortName/umass-mccc-parses (optional)	Same format as stanford-mccc-parses but used whendataset.parser=umass-mccc. These should be the result of parsing umass-tokenizations with the McClosky-Charniak-Johnson parser.

Configuration
See the files trigger_classifier_defaults.props andevent_parser_defaults.props from the code distribution as a basis. Each file tells you which settings should be adjusted to fit your system and which ones can likely be left alone.
Running the tokenizer (optional)
Our code sometimes doesn't work well with the default BioNLP 2011 task tokenizations. Our own tokenizations/segmentations can be found in the distributed parses and tokenizations file, or produced by the class RunBioNLPTokenizer over your base directory:
java -cp _classpath_ edu.stanford.nlp.ie.machinereading.domains.bionlp.RunBioNLPTokenizer -base.directory _base_
Running the trigger classifier (optional)
This step is optional if you've downloaded theparses, tokenizations, and triggersarchive and you're working off BioNLP 2011 data. If you're working on other (non-shared task data), you'll need to run the trigger classifier over it. The trigger classifier can be run with the following command
java -cp _classpath_ edu.stanford.nlp.ie.machinereading.domains.bionlp.TriggerClassifier -props _properties_
where _properties_ is the filename for your trigger classifier properties from the Configuration step.
Running the event parser
The event parser can be run with the following command
java -cp _classpath_ edu.stanford.nlp.ie.machinereading.domains.bionlp.EventParser -props _properties_
where _properties_ is the filename for your event parser properties from the Configuration step.

Frequently Asked Question

**What is sanity check 1? Does it matter that it's failing frequently?**Sanity check is whether events and their arguments are in the same sentence. Since there are quite a large number of cases where events and their arguments are not in the same sentence in the BioNLP corpora, this should not really be a concern. Of course, edges connecting events and arguments that span sentences are dropped, so they are a concern from that standpoint if you're working on improving the event parser.

Other questionsPlease email David McCloskyand Mihai Surdeanu if you have other questions. The distribution is still in beta and likely in need of more testing so feel free to ask.

Release History

Version 1.0	August, 2nd 2011	Initial release