Statistical filtering and subcategorization frame acquisition (original) (raw)

Automatic acquisition of a large subcategorization dictionary from corpora

Proceedings of the 31st annual meeting on Association for Computational Linguistics -, 1993

This paper presents a new method for producing a dictionary of subcategorization frames from unlabelled text corpora. It is shown that statistical filtering of the results of a finite state parser running on the output of a stochastic tagger produces high quality results, despite the error rates of the tagger and the parser. Further, it is argued that this method can be used to learn all subcategorization frames, whereas previous methods are not extensible to a general solution to the problem.

Generalizing Subcategorization Frames Acquired from Corpora Using Lexicalized Grammars

2004

This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF confidence-value vector for each word. Since the SCF confidence vectors obtained from the lexicon of the target grammar involve co-occurrence tendency among SCFs for words, we can improve the quality of the acquired SCFs by clustering vectors obtained from the acquired SCF lexicon and the lexicon of the target grammar. We apply our method to SCFs acquired from corpora by using a subset of the SCF lexicon of the XTAG English grammar. A comparison between the resulting SCF lexicon and the rest of the lexicon of the XTAG English grammar reveals that we can achieve higher precision and recall compared to naive frequency cutoff .

Automatic acquisition of subcategorization frames from tagged text

Proceedings of the workshop on Speech and Natural Language - HLT '91, 1991

This paper describes an implemented program that takes a raw, untagged text corpus as its only input (no open-class dictionary) and generates a partial list of verbs occurring in the text and the subcategorization frames (SFs) in which they occur. Verbs are detected by a novel technique based on the Case Filter of Rouvret and Vergnaud (1980). The completeness of the output list increases monotonically with the total number of occurrences of each verb in the corpus. False positive rates are one to three percent of observations. Five SFs are currently detected and more are planned. Ultimately, I expect to provide a large SF dictionary to the NLP community and to train dictionaries for specific corpora.

The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora

1993

We describe a mechanism for automatically acquiring verb subcategorization frames and their frequencies in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a finear grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, our current system showed more than 80% accuracy. In addition, a new statistical approach substantially improves the accuracy of the frequency estimation.

Automatic acquisition of adjectival subcategorization from corpora

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005

This paper describes a novel system for acquiring adjectival subcategorization frames (SCFs) and associated frequency information from English corpus data. The system incorporates a decision-tree classifier for 30 SCF types which tests for the presence of grammatical relations (GRs) in the output of a robust statistical parser. It uses a powerful patternmatching language to classify GRs into frames hierarchically in a way that mirrors inheritance-based lexica. The experiments show that the system is able to detect SCF types with 70% precision and 66% recall rate. A new tool for linguistic annotation of SCFs in corpus data is also introduced which can considerably alleviate the process of obtaining training and test data for subcategorization acquisition.

A system for large-scale acquisition of verbal, nominal and adjectival subcategorization frames from corpora

2007

This paper describes the first system for large-scale acquisition of subcategorization frames (SCFs) from English corpus data which can be used to acquire comprehensive lexicons for verbs, nouns and adjectives. The system incorporates an extensive rulebased classifier which identifies 168 verbal, 37 adjectival and 31 nominal frames from grammatical relations (GRs) output by a robust parser. The system achieves state-ofthe-art performance on all three sets.

Automatic extraction of subcategorization from corpora

1997

We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount 1.

A procedure to automatically enrich verbal lexica with subcategorization frames

INTELIGENCIA ARTIFICIAL, 2008

In this paper we introduce a method for automatically assigning subcategorization frames to previously unseen verbs of Spanish, as an aid to syntactical analysis. Since there is not a consensus on the classes of subcategorization frames, we combine supervised and unsupervised learning. We apply clustering techniques to obtain coarse-grained subcategorization classes from an annotated corpus of Spanish, then evaluate these classes and we finally use them to learn a classifier to assign subcategorization frames to the verbs of previously unseen sentences.