The Phrase Detective Multilingual Corpus, Release 0.1 (original) (raw)
Related papers
Anaphoric annotation in the ARRAU corpus
Proceedings of the …, 2008
Arrau is a new corpus annotated for anaphoric relations, with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The corpus contains texts from different genres: task-oriented dialogues from the Trains-91 and Trains-93 corpus, narratives from the English Pear Stories corpus, newspaper articles from the Wall Street Journal portion of the Penn Treebank, and mixed text from the Gnome corpus.
ARRAU: Linguistically-Motivated Annotation of Anaphoric Descriptions
2016
This paper presents a second release of the ARRAU dataset: a multi-domain corpus with thorough linguistically motivated annotation of anaphora and related phenomena. Building upon the first release almost a decade ago, a considerable effort had been invested in improving the data both quantitatively and qualitatively. Thus, we have doubled the corpus size, expanded the selection of covered phenomena to include referentiality and genericity and designed and implemented a methodology for enforcing the consistency of the manual annotation. We believe that the new release of ARRAU provides a valuable material for ongoing research in complex cases of coreference as well as for a variety of related tasks. The corpus is publicly available through LDC.
Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus
Natural Language Engineering
This paper presents the second release of arrau, a multigenre corpus of anaphoric information created over 10 years to provide data for the next generation of coreference/anaphora resolution systems combining different types of linguistic and world knowledge with advanced discourse modeling supporting rich linguistic annotations. The distinguishing features of arrau include the following: treating all NPs as markables, including non-referring NPs, and annotating their (non-) referentiality status; distinguishing between several categories of non-referentiality and annotating non-anaphoric mentions; thorough annotation of markable boundaries (minimal/maximal spans, discontinuous markables); annotating a variety of mention attributes, ranging from morphosyntactic parameters to semantic category; annotating the genericity status of mentions; annotating a wide range of anaphoric relations, including bridging relations and discourse deixis; and, finally, annotating anaphoric ambiguity. T...
Annotation of anaphoric expressions in an aligned bilingual corpus
This paper discusses a French-English corpus annotated and aligned at anaphoric level. It also presents an annotation scheme based on the study of a detailed corpus featuring different types of correspondences and mismatches. The scheme which is adapted from EAGLES recommendations, supports the alignment at anaphoric level and caters for the different kinds of mismatches.
Constructing an anaphorically annotated corpus with non-experts
Proceedings of the 2009 Workshop on The People's Web Meets NLP Collaboratively Constructed Semantic Resources - People's Web '09, 2009
This paper reports on the ongoing work of Phrase Detectives, an attempt to create a very large anaphorically annotated text corpus. Annotated corpora of the size needed for modern computational linguistics research cannot be created by small groups of hand-annotators however the ESP game and similar games with a purpose have demonstrated how it might be possible to do this through Web collaboration. We show that this approach could be used to create large, high-quality natural language resources.
Annotating a large corpus with anaphoric links
2000
Abstract This paper presents a one million word French corpus annotated with anaphoric links. The anaphoric expressions selected are mainly grammatical discourse phenomena for which a reliable annotation could be provided. The annotation scheme, defined in XML, encodes the orientation of the anaphoric relation by using a specific element for relating the anaphoric expression to its antecedent (s). A set of five semantic relations is used to type the anaphoric relation.
2009
Abstract This paper reports on the ongoing work of Phrase Detectives, an attempt to create a very large anaphorically annotated text corpus. Annotated corpora of the size needed for modern computational linguistics research cannot be created by small groups of hand-annotators however the ESP game and similar games with a purpose have demonstrated how it might be possible to do this through Web collaboration. We show that this approach could be used to create large, high-quality natural language resources.
Towards the Automatic Resolution of Anaphora with Non-nominal Antecedents: Insights from Annotation
ISBN, 2018
This paper deals with a particular form of anaphora in which the anaphors refer to non-nominal antecedents. We investigate two existing datasets, annotated with pronominal and nominal anaphors (shell nouns) respectively, and attempt to determine to what degree the different types of anaphors provide useful hints as to the form and location of their antecedents. To this end, we look at the distribution of the antecedents, their syntactic form, and their semantic content. In particular, as the difficulty of annotating the phenomenon constitutes a major hurdle to the development of larger datasets, we take a close look at the agreement between annotators and relate this to the different types of anaphors.
Proceedings of the Third Linguistic Annotation Workshop on - ACL-IJCNLP '09, 2009
In this paper, we present preliminary work on corpus-based anaphora resolution of discourse deixis in German. Our annotation guidelines provide linguistic tests for locating the antecedent, and for determining the semantic types of both the antecedent and the anaphor. The corpus consists of selected speaker turns from the Europarl corpus.
PHORA: A system to solve the Anaphora in Spanish
Proceedings of …, 2000
In this paper we present a whole Natural Language Processing (NLP) system for Spanish. The core of this system is the parser, which uses the grammatical formalism Lexical-Functional Grammars (LFG). Another important component of this system is the anaphora resolution module. To solve the anaphora, this module contains a method based on linguistic information (lexical, morphological, syntactic and semantic), structural information (anaphoric accessibility space in which the anaphor obtains the antecedent) and statistical information. This method is based on constraints and preferences and solves pronouns and definite descriptions. Moreover, this system fits dialogue and non-dialogue discourse features. The anaphora resolution module uses several resources, such as a lexical database (Spanish WordNet) to provide semantic information and a POS tagger providing the part of speech for each word and its root to make this resolution process easier.