Adapting a WSJ-trained parser to grammatically noisy text (original) (raw)
Related papers
A robust parser based on syntactic information
Proceedings of the seventh …, 1995
An extragrammatical sentence is what a normal parser fails to analyze. It is important to recover it using only syntactic information although results of recovery are better if semantic factors are considered. A general algorithm .for least-errors recognition, which is based only on syntactic information, was proposed by G. Lyon to deal with the extragrammaticality. We extended this algorithm to recover extragrammatical sentence into grammatical one in running text. Our robust parser with recovery mechanism-extended general algorithm for least-errors recognitioncan be easily scaled up and modified because it utilize only syntactic information. To upgrade this robust parser we proposed heuristics through the analysis on the Penn treebank corpus. The experimental result shows 68% ,~ 77% accuracy in error recovery.
Coping with problems in grammars automatically extracted from Treebanks
COLING-02 on Grammar engineering and evaluation -, 2002
We report in this paper on an experiment on automatic extraction of a Tree Adjoining Grammar from the WSJ corpus of the Penn Treebank. We use an automatic tool developed by (Xia, 2001) properly adapted to our particular need. Rather than addressing general aspects of the automatic extraction we focus on the problems we have found to extract a linguistically (and computationally) sound grammar and approaches to handle them.
Ensembling Dependency Parsers for Treebank Error Detection
This paper describes a statistical approach to detect annotation errors in dependency treebanks. The approach is based on the ensembling of stateof-the-art dependency parsers. We see the motivation from the fact that if a parse, favoured by the parsers, contradicts human annotation, the contradiction either questions the consistency of the corpora on which the parsers were trained or the given human annotation is an error. We also prioritize the detected errors based on the confidence score values. The reported results (F-score) of our approach on the Urdu and Hindi treebanks are 41.20% and 69.37% respectively.
Building a large annotated corpus of English: the penn treebank
Computational Linguistics, 1994
There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable for enterprises as diverse as the automatic construction of statistical models for the grammar of the written and the colloquial spoken language, the development of explicit formal theories of the differing grammars of writing and speech, the investigation of prosodic phenomena in speech, and the evaluation and comparison of the adequacy of parsing models.
A classifier-based parser with linear run-time complexity
Proceedings of the Ninth International Workshop …, 2005
We present a classifier-based parser that produces constituent trees in linear time. The parser uses a basic bottom-up shiftreduce algorithm, but employs a classifier to determine parser actions instead of a grammar. This can be seen as an extension of the deterministic dependency parser of to full constituent parsing. We show that, with an appropriate feature set used in classification, a very simple one-path greedy parser can perform at the same level of accuracy as more complex parsers. We evaluate our parser on section 23 of the WSJ section of the Penn Treebank, and obtain precision and recall of 87.54% and 87.61%, respectively.
A Dependency-based Analysis of Treebank Annotation Errors
2011
In this paper, we investigate errors in syntax annotation with the Turku Dependency Treebank, a recently published treebank of Finnish, as study material. This treebank uses the Stanford Dependency scheme as its syntax representation, and its published data contains all data created in the full double annotation as well as timing information, both of which are necessary for this study. First, we examine which syntactic structures are the most error-prone for human annotators, and compare these results to those of a baseline automatic parser. We find that annotation decisions involving highly semantic distinctions, as well as certain morphological ambiguities, are especially difficult for both human annotators and the parser. Second, we train an automatic system that offers for inspection sentences ordered by their likelihood of containing errors. We find that the system achieves a performance that is clearly superior to the random baseline: for instance, by inspecting 10% of all sen...
Shallow parsing using noisy and non-stationary training material
The Journal of Machine Learning Research, 2002
Shallow parsers are usually assumed to be trained on noise-free material, drawn from the same distribution as the testing material. However, when either the training set is noisy or else drawn from a different distributions, performance may be degraded. Using the parsed Wall Street Journal, we investigate the performance of four shallow parsers (maximum entropy, memory-based learning, N-grams and ensemble learning) trained using various types of artificially noisy material. Our first set of results show that shallow parsers are surprisingly robust to synthetic noise, with performance gradually decreasing as the rate of noise increases. Further results show that no single shallow parser performs best in all noise situations. Final results show that simple, parser-specific extensions can improve noise-tolerance. Our second set of results addresses the question of whether naturally occurring disfluencies undermines performance more than does a change in distribution. Results using the parsed Switchboard corpus suggest that, although naturally occurring disfluencies might harm performance, differences in distribution between the training set and the testing set are more significant.
This paper compares two techniques for robust parsing of extra-gram- matical natural language that might be of interest in large scale Textual Data Analysis applications. The first one returns a "correct" derivation for any extra- grammatical sentence by generating the finest corresponding most probable optimal maximum coverage. The second one extends the initial grammar by adding relaxed grammar rules in a controlled manner. Both techniques use a stochastic parser that selects a "best" solution among multiple analyses. The techniques were tested on the ATIS and Susanne corpora and experimental results, as well as conclusions on performance comparison, are provided.
How To Detect Grammatical Errors In A Text Without Parsing IT
1987
The Constituent Likelihood Automatic Word-tagging System (CLAWS) was originally designed for the low-level grammatical analysis of the million-word LOB Corpus of English text samples. CLAWS does not attempt a full parse, but uses a firat-order Markov model of language to assign word-class labels to words. CLAWS can be modified to detect grammatical errors, essentially by flagging unlikely word-class transitions in the input text. This may seem to be an intuitively implausible and theoretically inadequate model of natural language syntax, but nevertheless it can successfully pinpoint most grammatical errors in a text. Several modifications to CLAWS have been explored. The resulting system cannot detect all errors in typed documents; but then neither do far more complex systems, which attempt a full parse, requiting much greater computation. Akkerman, Erik, Pieter Masereeuw, and Willem Meijs 1985 Designing a computerized lexicon for linguistic purposes Rodopi, Amsterdam Atwell, Eric Steven 1981 LOB Corpus Tagging Project: Manual Pre-edit Handbook. Departments of Computer Studies and Linguistics, University of Lancaster Atwell, Eric Steven 1982 LOB Corpus Tagging Project: Manual Postedit Handbook (A mini-grammar of LOB Corpus English, examining the types of error commonly made during automatic (computational) analysis of ordinary written EnglishJ