Parse Selection on the Redwoods Corpus: 3rd Growth Results (original) (raw)

Stochastic HPSG Parse Disambiguation using the Redwoods Corpus

Research on Language and Computation, 2005

This article details our experiments on HPSG parse disambiguation, based on the Redwoods treebank. Using existing and novel stochastic models, we evaluate the usefulness of different information sources for disambiguation – lexical, syntactic, and semantic. We perform careful comparisons of generative and discriminative models using equivalent features and show the consistent advantage of discriminatively trained models. Our best system performs at over 76% sentence exact match accuracy.

Using treebanking discriminants as parse disambiguation features

2009

Abstract This paper presents a novel approach of incorporating fine-grained treebanking decisions made by human annotators as discriminative features for automatic parse disambiguation. To our best knowledge, this is the first work that exploits treebanking decisions for this task. The advantage of this approach is that use of human judgements is made. The paper presents comparative analyses of the performance of discriminative models built using treebanking decisions and state-of-the-art features.

Probabilistic disambiguation models for wide-coverage HPSG parsing

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005

This paper reports the development of loglinear models for the disambiguation in wide-coverage HPSG parsing. The estimation of log-linear models requires high computational cost, especially with widecoverage grammars. Using techniques to reduce the estimation cost, we trained the models using 20 sections of Penn Treebank. A series of experiments empirically evaluated the estimation techniques, and also examined the performance of the disambiguation models on the parsing of real-world sentences.

The TreeBanker. A tool for supervised training of parsed corpora

Proceedings of the Workshop on Computational …, 1997

I describe the TreeBanker, a graphical tool for the supervised training involved in domain customization of the disambiguation component of a speech-or languageunderstanding system. The TreeBanker presents a user, who need not be a ...

Multi-level disambiguation grammar inferred from English corpus, treebank, and dictionary

1993

Inthis paper we will showt hat Grammatical Inference is applicable to Natural Language Processing. Given the wide and complexrange of structures appearing in an unrestricted Natural Language likeE nglish, full Grammatical Inference, yielding a comprehensive syntactic and semantic definition of English, is too much to hope for at present. Instead, we focus on techniques for dealing with ambiguity resolution by probabilistic ranking; this does not require a full formal Chomskyan grammar.W eg iv e as hort overviewo ft he different levels and methods being investigated at CCALAS for probabilistic ranking of candidates in ambiguous English input. Grammatical Inference from English corpora. An earlier title for this paper was "Overviewo fg rammar acquisition research at CCALAS, Leeds University", but this was modified to avoid the impression of an incoherent set of research strands with no integrated, focussed common techniques or applications. The researchers in our group have nod etailed development plan imposed 'from above', but are working on independent PhD programmes; however, there are common theoretical tennets, ideas, and potential applications linking individual projects. In fact, preparing for the Colloquia on Grammatical Inference has helped us to appreciate these overarching, linking themes, as we realised that the definitions stated in the Programme clearly applied to our own work at CCALAS: 'Grammatical Inference ... has suffered from the lack of a focused research community ... Simply stated, the grammatical inference problem is to learn an efficient description that captures the essence of a set of data. This description may be used subsequently to classify data, or to generate further examples of similar data.' The data in our case is unrestricted English input, as exemplified by a Corpus or large collection of text samples. This renders a much harder challenge to Grammatical Inference than artificial languages, or selected examples of wellformed English sentences. The range of lexical items and grammatical constructs appearing in an unrestricted English Corpus is very large; and the problem is not just one of scale. The Corpus-based approach carries with it a blurring of the classical Chomskyan distinction between 'grammatical' and 'ungrammatical' English sentences. Indeed, [Sampson 87] went to the extreme of positing that there is NO boundary between grammatical and ungrammatical sentences in English; this might seem to imply that it is hopeless and eveni nv alid to attempt to infer a grammar for English. Furthermore, the Corpus-based approach eschews the use of 'intuitively constructed' examples in training: a learning algorithm should be trained with 'real' sentences from a Corpus. It would seem to followf rom this that we are also proscribed from artificially constructing negative counterexamples for our learning algorithms: we cannot guarantee that such counterexamples are truly illegal.

Elementary Trees For Syntactic And Statistical Disambiguation

2000

In this paper we argue in favour of an integration between statistically and syntactically based parsing, where syntax is intended in terms of shallow parsing with elementary trees. None of the statistically based analyses produce an accuracy level comparable to the one obtained by means of linguistic rules . Of course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, purely statistically based approaches are inefficient basically due to great sparsity of tag distribution -50% or less of unambiguous tags when punctuation is subtracted from the total count as reported by . We shall discuss our general statistical and syntactic framework and then we shall report on an experiment with four different setups: the first two approaches are bottom-up driven, i.e. from local tag combinations: A. Statistics only tag disambiguation; B. Stastistics plus syntactic biases; C. Syntactic-driven disambiguation with no statistics; D. Syntactic-driven disambiguation with conditional probabilities computed on syntactic constituents. The second two approaches are top-down driven, i.e. driven from syntactic structural cues in terms of elementary trees: In a preliminary experiment we made with automatic tagger, we obtained 99% accuracy in the training set and 98% in the test set using combined approaches: data derived from statistical tagging is well below 95% even when referred to the training set, and the same applies to syntactic tagging.

A Model of Syntactic Disambiguation Based on Lexicalized Grammars

This paper presents a new approach to syntactic disambiguation based on lexicalized grammars. While existing disambiguation models decompose the probability of parsing results into that of primitive dependencies of two words, our model selects the most probable parsing result from a set of candidates allowed by a lexicalized grammar. Since parsing results given by the lexicalized grammar cannot be decomposed into independent sub-events, we apply a maximum entropy model for feature forests, which allows probabilistic modeling without the independence assumption. Our approach provides a general method of producing a consistent probabilistic model of parsing results given by lexicalized grammars.

Effective statistical models for syntactic and semantic disambiguation

2005

My thesis work would not have been possible without the help of my advisor, other collaborators, and fellow students. I am especially fortunate to have been advised by Chris Manning. Firstly, I am grateful to him for teaching me almost everything I know about doing research and being part of the academic community. Secondly, I deeply appreciate his constant support and advice on many levels. The work in this thesis was profoundly shaped by numerous insightful discussions with him. I am also very happy to have been able to collaborate with Andrew Ng on random walk models for word dependency distributions. It has been a source of inspiration to interact with someone having such far-reaching research goals and being an endless source of ideas. He also gave me valuable advice on multiple occasions. Many thanks to Dan Jurafsky for initially bringing semantic role labeling to my attention as an interesting domain my research would fit in, contributing useful ideas, and helping with my dissertation on a short notice. Thanks also to the other members of my thesis defense committee-Trevor Hastie and Francine Chen. I am also grateful to Dan Flickinger and Stephan Oepen for numerous discussions on my work in hpsg parsing. Being part of the NLP and the greater AI group at Stanford has been extremely stimulating and fun. I am very happy to have shared an office with Dan Klein and Roger Levy for several years, and with Bill McCartney for one year. Dan taught me, among other things, to always aim to win when entering a competition, and to understand things from first principles. Roger was an example of how to be an excellent researcher while maintaining a balanced life with many outside interests. vi I will miss the heated discussions about research and the fun at NLP lunch. And thanks to Jenny for the Quasi-Newton numerical optimization code. Thanks to Aria Haghighi for collaborating on semantic role labeling models and for staying up very early in the morning to finish writing up papers. I would also like to take the chance to express my gratitude to my undergraduate advisor Galia Angelova and my high-school math teacher Georgi Nikov. Although they did not contribute directly to the current effort, I wouldn't be writing this without them. I am indebted in many ways to Penka Markova-most importantly, for being my friend for many years and for teaching me optimism and self-confidence, and additionally, for collaborating with me on the hpsg parsing work. I will miss the foosball games and green tea breaks with Rajat Raina and Mayur Naik. Thanks also to Jason Townsend and Haiyan Liu for being my good friends. Thanks to Galen Andrew for making my last year at Stanford the happiest. Finally, many thanks to my parents Diana and Nikola, my sister Maria, and my nieces Iana and Diana, for their support, and for bringing me great peace and joy. I gratefully acknowledge the financial support through the ROSIE project funded by Scottish Enterprise under the Stanford-Edinburgh link programme and through ARDA's Advanced Question Answering for Intelligence (AQUAINT) program.

Learning to Disambiguate Syntactic Relations

Linguistik online, 2003

Natural Language is highly ambiguous, on every level. This article describes a fast broadcoverage state-of-the-art parser that uses a carefully handwritten grammar and probabilitybased machine learning approaches on the syntactic level. It is shown in detail which statistical learning models based on Maximum-Likelihood Estimation (MLE) can support a highly developed linguistic grammar in the disambiguation process.