Coreference Resolution (original) (raw)

Table of contents


Description

The CorefAnnotator finds mentions of the same entity in a text, such as when “Theresa May” and “she” refer to the same person. The annotator implements both pronominal and nominal coreference resolution. The entire coreference graph (with head words of mentions as nodes) is saved as a CorefChainAnnotation.

Overview

There are three different coreference systems available in CoreNLP.

(We briefly also had a fourth hybrid or hcoref system, but it is no longer supported and models are no longer provided in current releases.)

The following table gives an overview of the system performances.

System Language Preprocessing Time Coref Time Total Time F1 Score
Deterministic English 3.87s 0.11s 3.98s 49.5
Statistical English 0.48s 1.23s 1.71s 56.2
Neural English 3.22s 4.96s 8.18s 60.0
Deterministic Chinese 0.39s 0.16s 0.55s 47.5
Neural Chinese 0.42s 7.02s 7.44s 53.9

Command Line Usage

There are example properties files for using the coreference systems in edu/stanford/nlp/coref/properties. The properties are named [system]-[language].properties. For example, to run the deterministic system on Chinese:

java -cp stanford-corenlp-4.0.0.jar:stanford-chinese-corenlp-models-4.0.0.jar:* edu.stanford.nlp.pipeline.StanfordCoreNLP -props edu/stanford/nlp/coref/properties/deterministic-chinese.properties -file example_file.txt

Alternatively, the properties can be set manually. For example, to run the neural system on English:

java -cp stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models.jar:* edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,coref -coref.algorithm neural -file example_file.txt

See below for further options.

API

The following example shows how to access coref and mention information from an Annotation:

import java.util.Properties;

import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.coref.data.Mention;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class CorefExample {
  public static void main(String[] args) throws Exception {
    Annotation document = new Annotation("Barack Obama was born in Hawaii.  He is the president. Obama was elected in 2008.");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,pos,lemma,ner,parse,coref");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);
    System.out.println("---");
    System.out.println("coref chains");
    for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
      System.out.println("\t" + cc);
    }
    for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
      System.out.println("---");
      System.out.println("mentions");
      for (Mention m : sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)) {
        System.out.println("\t" + m);
       }
    }
  }
}

More Details

Deterministic System

This is a multi-pass sieve rule-based coreference system. See the Stanford Deterministic Coreference Resolution System page for usage and more details.

Statistical System

This is a mention-ranking model using a large set of features. It operates by iterating through each mention in the document, possibly adding a coreference link between the current one and a preceding mention at each step. Some relevant options:

Neural System

This is a neural-network-based mention-ranking model. Some relevant options:

Running on CoNLL 2012

Deterministic System

If you’d like to benchmark our deterministic system of the CoNLL 2011/2012 shared tasks, see the Usage section for the Stanford Deterministic Coreference Resolution System.

Usage Example

To use the English deterministic system, you need to use the dcoref annotator.

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,pos,lemma,ner,parse,dcoref -file example.txt

Statistical and Neural Systems

If you would like to run our statistical or neural systems on the CoNLL 2012 eval data:

  1. Get the CoNLL scoring script from here
  2. Get the CoNLL 2012 eval data from here
  3. Run the CorefSystem main method. For example, for the English neural system:
java -cp stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models.jar:* edu.stanford.nlp.coref.CorefSystem -props edu/stanford/nlp/coref/properties/neural-english-conll.properties -coref.data <path-to-conll-data> -coref.conllOutputPath <where-to-save-system-output> -coref.scorer <path-to-scoring-script>

The CoNLL 2012 coreference data differs from the normal coreference use case in a few ways:

Because of this, we train models with a few extra features for running on this dataset. We configure these models for accuracy over speed (e.g., by not having a maximum mention distance for the mention-ranking models). These models can be run using the -conll properties files (e.g., neural-english-conll.properties). Note that the CoNLL-specific models for English are in the English models jar, not the default CoreNLP models jar.

Training New Models

Deterministic System

As a rule-based system, there is nothing to train, but there are various data files for demonyms and to indicate noun gender, animacy, and plurality, which can be edited. See the Stanford Deterministic Coreference Resolution System page.

Statistical System

Training a statistical model on the CoNLL data can be done with the following command:

java -cp stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models.jar:* edu.stanford.nlp.coref.statistical.StatisticalCorefTrainer -props <properties-file>

See here for an example properties file. Training over the full CoNLL 2012 training set requires a large amount of memory. To reduce the memory footprint and runtime of training, the following options can be added to the properties file:

Neural System

The code for training the neural coreference system is implemented in python. It is available on github here.

Citing Stanford Coreference

The deterministic coreference system for English

Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013. The Life and Death of Discourse Entities: Identifying Singleton Mentions. In Proceedings of the NAACL. [pdf] [bib]

Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky. 2011. Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In Proceedings of the CoNLL-2011 Shared Task. [pdf] [bib]

Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky and Christopher Manning. 2010. A Multi-Pass Sieve for Coreference Resolution. Empirical Methods in Natural Language Processing (EMNLP). [pdf] [bib]

The deterministic coreference system for Chinese and English

Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity-centric, precision-ranked rules. In Computational Linguistics 39(4). [pdf]

The statistical coreference system

Kevin Clark and Christopher D. Manning. 2015. Entity-Centric Coreference Resolution with Model Stacking. In Proceedings of the ACL. [pdf] [bib]

The neural coreference system

Kevin Clark and Christopher D. Manning. 2016. Deep Reinforcement Learning for Mention-Ranking Coreference Models. In Proceedings of EMNLP. [pdf] [bib]

Kevin Clark and Christopher D. Manning. 2016. Improving Coreference Resolution by Learning Entity-Level Distributed Representations. In Proceedings of the ACL. [pdf] [bib]