Learning verb subcategorization from corpora: Counting frame subsets (original) (raw)
Related papers
Automatic extraction of subcategorization frames for Czech
Proceedings of the 18th conference on Computational linguistics -, 2000
We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount 1.
The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora
1993
We describe a mechanism for automatically acquiring verb subcategorization frames and their frequencies in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a finear grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, our current system showed more than 80% accuracy. In addition, a new statistical approach substantially improves the accuracy of the frequency estimation.
1993
We describe a mechanism for automatically estimating frequencies of verb subcategorization frames in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a regular grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, our current system showed more than 80% accuracy. In addition, a new statistical method enables the system to learn patterns of errors based on a set of training samples and substantially improves the accuracy of the frequency estimation.
Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora
2008
In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported results are in line with state-of-the-art lexical acquisition systems. The issue of whether verbs sharing similar SCFs distributions happen to share similar semantic properties as well was also explored by clustering verbs that share frames with the same distribution using the Minimum Description Length Principle (MDL). First experiments in this direction were carried out on Italian verbs with encouraging results.
Automatic extraction of subcategorization from corpora
1997
We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount 1.
Automatic acquisition of subcategorization frames from tagged text
Proceedings of the workshop on Speech and Natural Language - HLT '91, 1991
This paper describes an implemented program that takes a raw, untagged text corpus as its only input (no open-class dictionary) and generates a partial list of verbs occurring in the text and the subcategorization frames (SFs) in which they occur. Verbs are detected by a novel technique based on the Case Filter of Rouvret and Vergnaud (1980). The completeness of the output list increases monotonically with the total number of occurrences of each verb in the corpus. False positive rates are one to three percent of observations. Five SFs are currently detected and more are planned. Ultimately, I expect to provide a large SF dictionary to the NLP community and to train dictionaries for specific corpora.
Influence of Conditional Independence Assumption on Verb Subcategorization Detection
2001
Learning Bayesian Belief Networks from corpora has been applied to the automatic acquisition of verb subcategorization frames for Modern Greek (MG). We are incorporating minimal linguistic resources, i.e morphological tagging and phrase chunking, since a general-purpose syntactic parser for MG is currently unavailable. Comparative experimental results have been evaluated against Naive Bayes classification, which is based on the conditional independence assumption along with two widely used methods, Log-Likelihood (LLR) and Relative Frequencies Threshold (RFT). We have experimented with a balanced corpus in order to assure unbiased behaviour of the training model. Results have depicted that obtaining the inferential dependencies of the training data could lead to a precision improvement of about 4% compared to that of Naive Bayes and 7% compared to LLR and RFT Moreover, we have been able to achieve a precision exceeding 87% on the identification of subcategorization frames which are not known beforehand, while limited training data are proved to endow with satisfactory results.
Flexible Acquisition of Verb Subcategorization Frames in Italian
lrec-conf.org
This paper describes a web-service system for automatic acquisition of verb subcategorization frames (SCFs) from parsed data in Italian. The system acquires SCFs in an unsupervised manner. We created two gold standards for the evaluation of the system, the first by mixing together information from two lexica (one manually created and the second automatically acquired) and manual exploration of corpus data and the other annotating data extracted from a specialized corpus (domain environment). Data filtering is accomplished by means of the maximum likelihood estimate (MLE). In addition to this, we assign to the extracted entries of the lexicon a confidence score and evaluate the extractor on domain specific data. The confidence score will allow the final user to easily select the entries of the lexicon in terms of their reliability.
Computers and the Humanities, 2004
Current term recognition algorithms have centred mostly on the notion of term based on the assumption that terms are monoreferential and as such independent of context. The characteristics and behaviour of terms in real texts are however far removed from this ideal because factors such as text type or communicative situation greatly influence the linguistic realisation of a concept. Context, therefore, is important for the correct identification of terms (Dubuc and Lauriston, 1997). Based on this assumption, we have shifted our emphasis from terms towards surrounding linguistic context, namely verbs, as verbs are considered the central elements in the sentence. More specifically, we have set out to examine whether verbs and verbal syntax in particular, could help us towards the task of automatic term recognition. Our findings suggest that term occurrence varies significantly in different argument structures and different syntactic positions. Additionally, deviant grammatical structures have proved rich environments for terms. The analysis was carried out in three different specialised subcorpora in order to explore how the effectiveness of verbal syntax as a potential indicator of term occurrence can be constrained by factors such as subject matter and text type.