Advances in Probabilistic and Other Parsing Technologies (original) (raw)

Statistical confidence measures for probabilistic parsing

2009

We introduce a formal framework that allows the calculation of new purely statistical confidence measures for parsing, which are estimated from posterior probability of constituents. These measures allow us to mark each constituent of a parse tree as correct or incorrect. Experimental assessment using the Penn Treebank shows favorable results for the classical confidence evaluation metrics: the CER and the ROC curve. We also present preliminar experiments on application of confidence measures to improve parse trees by automatic constituent relabeling.

A Richly Annotated Corpus for Probabilistic Parsing

1992

This paper describes the use of a small but syntactically rich parsed corpus of English in probabilistic parsing. Software has been developed to extract probabilistic systemic-f~nctional grammars (SFGs) from the Polytechnic of Wales Corpus in several formalisms, which could equally well be applied to other parsed corpora. To complement the large probabilistic grammar, we discuss progress in the provision of lexical resources, which range from corpus wordlists to a large lexical database supplemented with word frequencies and SFG categories.

Integration of Data from a Syntactic Lexicon into Generative and Discriminative Probabilistic Parsers

2011

This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeill'e et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006) and a discriminative parser based on a reranking algorithm (Charniak and Johnson, 2005).

Integration of Data from a Syntactic Lexicon into a Generative and a Discriminative Probabilistic Parsers

International conference on Recent Advances in Natural Language Processing (RANLP'11), 2011

This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeill'e et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006) and a discriminative parser based on a reranking algorithm (Charniak and Johnson, 2005).

Automated extraction of Tree-Adjoining Grammars from treebanks

Natural Language Engineering, 2005

There has been a contemporary surge of interest in the application of stochastic models of parsing. The use of tree-adjoining grammar (TAG) in this domain has been relatively limited due in part to the unavailability, until recently, of large-scale corpora hand-annotated with TAG structures. Our goals are to develop inexpensive means of generating such corpora and to demonstrate their applicability to stochastic modeling. We present a method for automatically extracting a linguistically plausible TAG from the Penn Treebank. Furthermore, we also introduce labor-inexpensive methods for inducing higher-level organization of TAGs. Empirically, we perform an evaluation of various automatically extracted TAGs and also demonstrate how our induced higher-level organization of TAGs can be used for smoothing stochastic TAG models.

Feature Forest Models for Probabilistic HPSG Parsing

Computational Linguistics, 2008

Probabilistic modeling of lexicalized grammars is difficult because these grammars exploit complicated data structures, such as typed feature structures. This prevents us from applying common methods of probabilistic modeling in which a complete structure is divided into substructures under the assumption of statistical independence among sub-structures. For example, part-of-speech tagging of a sentence is decomposed into tagging of each word, and CFG parsing is split into applications of CFG rules. These methods have relied on the structure of the target problem, namely lattices or trees, and cannot be applied to graph structures including typed feature structures. This article proposes the feature forest model as a solution to the problem of probabilistic modeling of complex data structures including typed feature structures. The feature forest model provides a method for probabilistic modeling without the independence assumption when probabilistic events are represented with feature forests. Feature forests are generic data structures that represent ambiguous trees in a packed forest structure. Feature forest models are maximum entropy models defined over feature forests. A dynamic programming algorithm is proposed for maximum entropy estimation without unpacking feature forests. Thus probabilistic modeling of any data structures is possible when they are represented by feature forests. This article also describes methods for representing HPSG syntactic structures and predicate-argument structures with feature forests. Hence, we describe a complete strategy for developing probabilistic models for HPSG parsing. The effectiveness of the proposed methods is empirically evaluated through parsing experiments on the Penn Treebank, and the promise of applicability to parsing of real-world sentences is discussed.

Probabilistic parsing strategies

Journal of the ACM (JACM), 2006

We present new results on the relation between purely symbolic contextfree parsing strategies and their probabilistic counter-parts. Such parsing strategies are seen as constructions of push-down devices from grammars. We show that preservation of probability distribution is possible under two conditions, viz. the correct-prefix property and the property of strong predictiveness. These results generalize existing results in the literature that were obtained by considering parsing strategies in isolation. From our general results we also derive negative results on so-called generalized LR parsing.

An efficient probabilistic context-free parsing algorithm that computes prefix probabilities

1995

We describe an extension of Earley's parser for stochastic context-free grammars that computes the following quantities given a stochastic context-free grammar and an input string: a) probabilities of successive prefixes being generated by the grammar; b) probabilities of substrings being generated by the nonterminals, including the entire string being generated by the grammar; c) most likely (Viterbi) parse of the string; d) posterior expected number of applications of each grammar production, as required for reestimating rule probabilities. Probabilities (a) and (b) are computed incrementally in a single left-to-right pass over the input. Our algorithm compares favorably to standard bottom-up parsing methods for SCFGs in that it works efficiently on sparse grammars by making use of Earley's top-down control structure. It can process any context-free rule format without conversion to some normal form, and combines computations for (a) through (d) in a single algorithm. Finally, the algorithm has simple extensions for processing partially bracketed inputs, and for finding partial parses and their likelihoods on ungrammatical inputs.

An information-theoretic measure to evaluate parsing difficulty across treebanks

ACM Transactions on Speech and Language Processing, 2013

With the growing interest in statistical parsing, special attention has recently been devoted to the problem of comparing different treebanks to assess which languages or domains are more difficult to parse relative to a given model. A common methodology for comparing parsing difficulty across treebanks is based on the use of the standard labeled precision and recall measures. As an alternative, in this article we propose an information-theoretic measure, called the expected conditional cross-entropy (ECC). One important advantage with respect to standard performance measures is that ECC can be directly expressed as a function of the parameters of the model. We evaluate ECC across several treebanks for English, French, German, and Italian, and show that ECC is an effective measure of parsing difficulty, with an increase in ECC always accompanied by a degradation in parsing accuracy.