A system for large-scale acquisition of verbal, nominal and adjectival subcategorization frames from corpora (original) (raw)

Automatic acquisition of adjectival subcategorization from corpora

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005

This paper describes a novel system for acquiring adjectival subcategorization frames (SCFs) and associated frequency information from English corpus data. The system incorporates a decision-tree classifier for 30 SCF types which tests for the presence of grammatical relations (GRs) in the output of a robust statistical parser. It uses a powerful patternmatching language to classify GRs into frames hierarchically in a way that mirrors inheritance-based lexica. The experiments show that the system is able to detect SCF types with 70% precision and 66% recall rate. A new tool for linguistic annotation of SCFs in corpus data is also introduced which can considerably alleviate the process of obtaining training and test data for subcategorization acquisition.

Automatic acquisition of subcategorization frames from tagged text

Proceedings of the workshop on Speech and Natural Language - HLT '91, 1991

This paper describes an implemented program that takes a raw, untagged text corpus as its only input (no open-class dictionary) and generates a partial list of verbs occurring in the text and the subcategorization frames (SFs) in which they occur. Verbs are detected by a novel technique based on the Case Filter of Rouvret and Vergnaud (1980). The completeness of the output list increases monotonically with the total number of occurrences of each verb in the corpus. False positive rates are one to three percent of observations. Five SFs are currently detected and more are planned. Ultimately, I expect to provide a large SF dictionary to the NLP community and to train dictionaries for specific corpora.

The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora

1993

We describe a mechanism for automatically acquiring verb subcategorization frames and their frequencies in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a finear grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, our current system showed more than 80% accuracy. In addition, a new statistical approach substantially improves the accuracy of the frequency estimation.

Generalizing Subcategorization Frames Acquired from Corpora Using Lexicalized Grammars

2004

This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF confidence-value vector for each word. Since the SCF confidence vectors obtained from the lexicon of the target grammar involve co-occurrence tendency among SCFs for words, we can improve the quality of the acquired SCFs by clustering vectors obtained from the acquired SCF lexicon and the lexicon of the target grammar. We apply our method to SCFs acquired from corpora by using a subset of the SCF lexicon of the XTAG English grammar. A comparison between the resulting SCF lexicon and the rest of the lexicon of the XTAG English grammar reveals that we can achieve higher precision and recall compared to naive frequency cutoff .

Automatic extraction of subcategorization from corpora

1997

We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount 1.

A procedure to automatically enrich verbal lexica with subcategorization frames

INTELIGENCIA ARTIFICIAL, 2008

In this paper we introduce a method for automatically assigning subcategorization frames to previously unseen verbs of Spanish, as an aid to syntactical analysis. Since there is not a consensus on the classes of subcategorization frames, we combine supervised and unsupervised learning. We apply clustering techniques to obtain coarse-grained subcategorization classes from an annotated corpus of Spanish, then evaluate these classes and we finally use them to learn a classifier to assign subcategorization frames to the verbs of previously unseen sentences.

A large subcategorization lexicon for natural language processing applications

2006

We introduce a large computational subcategorization lexicon which includes subcategorization frame (SCF) and frequency information for 6,397 English verbs. This extensive lexicon was acquired automatically from five corpora and the Web using the current version of the comprehensive subcategorization acquisition system of . The lexicon is provided freely for research use, along with a script which can be used to filter and build sub-lexicons suited for different natural language processing (NLP) purposes. Documentation is also provided which explains each sub-lexicon option and evaluates its accuracy. 11 Compare e.g. with the results reported in . Keeping in mind that the results are not fully comparable to the ones reported here (a smaller test corpus was used containing only 91 verbs), F-measure was 5-10 better, even though an older and a less accurate version of the same system was used. The main reason for this is the better fit between the smaller test data and the gold standard.

Subcategorisation Acquisition from Raw Text for a Free Word-Order Language

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014

We describe a state-of-the-art automatic system that can acquire subcategorisation frames from raw text for a free word-order language. We use it to construct a subcategorisation lexicon of German verbs from a large Web page corpus. With an automatic verb classification paradigm we evaluate our subcategorisation lexicon against a previous classification of German verbs; the lexicon produced by our system performs better than the best previous results.

Automatic extraction of subcategorization frames for Czech

Proceedings of the 18th conference on Computational linguistics -, 2000

We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount 1.

Automatic acquisition of a large subcategorization dictionary from corpora

Proceedings of the 31st annual meeting on Association for Computational Linguistics -, 1993

This paper presents a new method for producing a dictionary of subcategorization frames from unlabelled text corpora. It is shown that statistical filtering of the results of a finite state parser running on the output of a stochastic tagger produces high quality results, despite the error rates of the tagger and the parser. Further, it is argued that this method can be used to learn all subcategorization frames, whereas previous methods are not extensible to a general solution to the problem.