TIGER Corpus | Institute for Natural Language Processing | University of Stuttgart (original) (raw)

The TIGER Corpus (versions 2.1 and 2.2) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes. For details, see the annotation section. Version 2.2 is a cleaned up version of release 2.1.

The TIGER Corpus is delivered in two treebank formats:

Both versions of the corpus can be processed by the treebank query tool TIGERSearch, which has also been developed within the TIGER project.

Version 1 of the TIGER Corpus is still available as well. It consists of app. 700,000 tokens (40,000 sentences). With respect to version 2, it lacks the morphological and lemma information.

In addition to the TIGER Corpus proper, several resources derived from it are available. These are: