An information-theoretic measure to evaluate parsing difficulty across treebanks (original) (raw)
Related papers
Measuring Parsing Difficulty Across Treebanks
2008
Abstract One of the main difficulties in statistical parsing is associated with the task of choosing the correct parse tree for the input sentence, among all possible parse trees allowed by the adopted grammar model. While this difficulty is usually evaluated by means of empirical performance measures, such as labeled precision and recall, several theoretical measures have also been proposed in the literature, mostly based on the notion of cross-entropy of a treebank.
Using the Penn Treebank to evaluate non-treebank parsers
2004
Abstract This paper describes a method for conducting evaluations of Treebank and non-Treebank parsers alike against the English language U. Penn Treebank (Marcus et al., 1993) using a metric that focuses on the accuracy of relatively non-controversial aspects of parse structure. Our conjecture is that if we focus on maximal projections of heads (MPH), we are likely to find much broader agreement than if we try to evaluate based on order of attachment.
2005
In the last decade, the Penn treebank has become the standard data set for evaluating parsers. The fact that most parsers are solely evaluated on this specific data set leaves the question unanswered how much these results depend on the annotation scheme of the treebank. In this paper, we will investigate the influence which different decisions in the annotation schemes of treebanks have on parsing. The investigation uses the comparison of similar treebanks of German, NE-GRA and TüBa-D/Z, which are subsequently modified to allow a comparison of the differences. The results show that deleted unary nodes and a flat phrase structure have a negative influence on parsing quality while a flat clause structure has a positive influence.
Cross Parser Evaluation and Tagset Variation : a French Treebank Study
2009
This paper presents preliminary investigations on the statistical parsing of French by bringing a complete evaluation on French data of the main probabilistic lexicalized and unlexicalized parsers first designed on the Penn Treebank. We adapted the parsers on the two existing treebanks of French . To our knowledge, mostly all of the results reported here are state-of-the-art for the constituent parsing of French on every available treebank. Regarding the algorithms, the comparisons show that lexicalized parsing models are outperformed by the unlexicalized Berkeley parser. Regarding the treebanks, we observe that, depending on the parsing model, a tag set with specific features has direct influence over evaluation results. We show that the adapted lexicalized parsers do not share the same sensitivity towards the amount of lexical material used for training, thus questioning the relevance of using only one lexicalized model to study the usefulness of lexicalization for the parsing of French.
Comparing the Influence of Different Treebank Annotations on Dependency Parsing Performance
Language Resources and Evaluation, 2010
As the interest of the NLP community grows to develop several treebanks also for languages other than English, we observe efforts towards evaluating the impact of different annotation strategies used to represent particular languages or with reference to particular tasks. This paper contributes to the debate on the influence of resources used for the training and development on the performance of parsing systems.It presents a comparative analysis of the results achieved by three different dependency parsers developed and tested with respect to two treebanks for the Italian language, namely TUT and ISST-TANL, which differ significantly at the level of both corpus composition and adopted dependency representations.
A comparison of evaluation metrics for a broad-coverage stochastic parser
Proceedings of the LREC …, 2002
This paper reports on the use of two distinct evaluation metrics for assessing a stochastic parsing model consisting of a broad-coverage Lexical-Functional Grammar (LFG), an efficient constraint-based parser and a stochastic disambiguation model. The first evaluation metric measures matches of predicate-argument relations in LFG f-structures (henceforth the LFG annotation scheme) to a gold standard of manually annotated f-structures for a subset of the UPenn Wall Street Journal treebank. The other metric maps predicate-argument relations in LFG f-structures to dependency relations (henceforth DR annotations) as proposed by . For evaluation, these relations are matched against Carroll et al.'s gold standard which was manually annnotated on a subset of the Brown corpus. The parser plus stochastic disambiguator gives an F-measure of 79% (LFG) or 73% (DR) on the WSJ test set. This shows that the two evaluation schemes are similar in spirit, although accuracy is impaired systematically by mapping one annotation scheme to the other. A systematic loss of accuracy is incurred also by corpus variation: Training the stochastic disambiguation model on WSJ data and testing on Carroll et al.'s Brown corpus data yields an F-score of 74% (DR) for dependency-relation match. A variant of this measure comparable to the measure reported by Carroll et al. yields an F-measure of 76%. We examine divergences between annotation schemes aiming at a future improvement of methods for assessing parser quality.
Treebank Annotation Schemes and Parser Evaluation for {G}erman
Empirical Methods in Natural Language Processing, 2007
Recent studies focussed on the question whether less-conguration al languages like German are harder to parse than English, or whether the lower parsing scores are an artefact of treebank encoding schemes and data structures, as claimed by K¤ ubler et al. (2006). This claim is based on the as- sumption that PARSEVAL metrics fully re- ect parse quality across treebank
The Effect of Treebank Annotation Granularity on Parsing: A Comparative Study
hpsg.fu-berlin.de
Statistical parsers need annotated data for training. Depending on the available linguistic information in the training data, the performance of the parsers vary. In this paper, we study the effect of annotation granularity on parsing from three points of views: lexicon, part-of-speech tag, and phrase structure. The results show that changing annotation granularity at each of these dimensions has a significant impact on parsing performance.
A test of the leaf-ancestor metric for parse accuracy
Natural Language Engineering, 2003
The GEIG metric for quantifying the accuracy of parsing became influential through the Parseval programme, but many researchers have seen it as unsatisfactory. The Leaf-Ancestor (LA) metric, first developed in the 1980s, arguably comes closer to formalizing our intuitive concept of relative parse accuracy. We support this claim via an experiment that contrasts the performance of alternative metrics on the same body of automatically-parsed examples. The LA metric has the further virtue of providing straightforward indications of the location of parsing errors.
Evaluating the effects of treebank size in a practical application for parsing
Software Engineering, Testing, and Quality Assurance for Natural Language Processing on - SETQA-NLP '08, 2008
Natural language processing modules such as part-of-speech taggers, named-entity recognizers and syntactic parsers are commonly evaluated in isolation, under the assumption that artificial evaluation metrics for individual parts are predictive of practical performance of more complex language technology systems that perform practical tasks. Although this is an important issue in the design and engineering of systems that use natural language input, it is often unclear how the accuracy of an end-user application is affected by parameters that affect individual NLP modules. We explore this issue in the context of a specific task by examining the relationship between the accuracy of a syntactic parser and the overall performance of an information extraction system for biomedical text that includes the parser as one of its components. We present an empirical investigation of the relationship between factors that affect the accuracy of syntactic analysis, and how the difference in parse accuracy affects the overall system.