A procedure to automatically enrich verbal lexica with subcategorization frames (original) (raw)

Towards a semantic classification of Spanish verbs based on subcategorisation information

Proceedings of the ACL 2004 on Student research workshop -, 2004

We present experiments aiming at an automatic classification of Spanish verbs into lexical semantic classes. We apply well-known techniques that have been developed for the English language to Spanish, proving that empirical methods can be re-used through languages without substantial changes in the methodology. Our results on subcategorisation acquisition compare favourably to the state of the art for English. For the verb classification task, we use a hierarchical clustering algorithm, and we compare the output clusters to a manually constructed classification.

Obtaining coarse-grained classes of subcategorization patterns for Spanish

RANLP 2007, 2007

In this paper we introduce a method for automatically assigning a subcategorization frame to each verb in a grammar for deep parsing of Spanish. Our final objective is to learn a classifier to assign subcategorization frames to previously unseen verbs for which this information is not available in a hand-made lexicon. To do that, we first need to establish classes of equivalence of verbs according to their subcategorization frames. In this paper we describe how we apply clustering techniques to obtain coarse-grained subcategorization classes from an annotated corpus of Spanish and propose a methodology to evaluate them for the application of assigning subcategorization to previously unseen verbs.

Semi-automatic Generation of Subcategorization Frames for Spanish Verbs Using Ontologies and Verbs Functional Class

Journal of …, 2009

This work deals with the semi-automatic generation of subcategorization frames (SCFs) of Spanish verbs; specifically, given a set of verbs in Spanish and their respective sense, their SCFs are obtained. The acquisition of SCFs in Spanish has been approached in different works: in some the frames are generated manually, while in others they are obtained semi-automatically from a tagged corpus; unfortunately in this case, the results depend on the characteristics of the texts used. The method proposed in this document combines an ontology-based approach (through lexical relations of verbs) and linguistic knowledge (functional class of verbs). The relations among base verbs and other verbs were obtained from the Spanish WordNet ontology, which contains lexical relations among words. Also, the existing relation between the SCF and the functional class of verbs was used to generate the SCFs. In order to evaluate the method, the SCFs for 44 base verbs were generated manually, from which 239 SCFs were automatically generated and validated, yielding an accuracy of 89.38%.

Automatic acquisition of subcategorization frames from tagged text

Proceedings of the workshop on Speech and Natural Language - HLT '91, 1991

This paper describes an implemented program that takes a raw, untagged text corpus as its only input (no open-class dictionary) and generates a partial list of verbs occurring in the text and the subcategorization frames (SFs) in which they occur. Verbs are detected by a novel technique based on the Case Filter of Rouvret and Vergnaud (1980). The completeness of the output list increases monotonically with the total number of occurrences of each verb in the corpus. False positive rates are one to three percent of observations. Five SFs are currently detected and more are planned. Ultimately, I expect to provide a large SF dictionary to the NLP community and to train dictionaries for specific corpora.

Generalizing Subcategorization Frames Acquired from Corpora Using Lexicalized Grammars

2004

This paper presents a method of improving the quality of subcategorization frames (SCFs) acquired from corpora in order to augment a lexicon of a lexicalized grammar. We first estimate a confidence value that a word can have each SCF, and create an SCF confidence-value vector for each word. Since the SCF confidence vectors obtained from the lexicon of the target grammar involve co-occurrence tendency among SCFs for words, we can improve the quality of the acquired SCFs by clustering vectors obtained from the acquired SCF lexicon and the lexicon of the target grammar. We apply our method to SCFs acquired from corpora by using a subset of the SCF lexicon of the XTAG English grammar. A comparison between the resulting SCF lexicon and the rest of the lexicon of the XTAG English grammar reveals that we can achieve higher precision and recall compared to naive frequency cutoff .

The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora

1993

We describe a mechanism for automatically acquiring verb subcategorization frames and their frequencies in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a finear grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, our current system showed more than 80% accuracy. In addition, a new statistical approach substantially improves the accuracy of the frequency estimation.

Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora

2008

In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported results are in line with state-of-the-art lexical acquisition systems. The issue of whether verbs sharing similar SCFs distributions happen to share similar semantic properties as well was also explored by clustering verbs that share frames with the same distribution using the Minimum Description Length Principle (MDL). First experiments in this direction were carried out on Italian verbs with encouraging results.

A system for large-scale acquisition of verbal, nominal and adjectival subcategorization frames from corpora

2007

This paper describes the first system for large-scale acquisition of subcategorization frames (SCFs) from English corpus data which can be used to acquire comprehensive lexicons for verbs, nouns and adjectives. The system incorporates an extensive rulebased classifier which identifies 168 verbal, 37 adjectival and 31 nominal frames from grammatical relations (GRs) output by a robust parser. The system achieves state-ofthe-art performance on all three sets.

Frequency estimation of verb subcategorization frames based on syntactic and multidimensional statistical analysis

1993

We describe a mechanism for automatically estimating frequencies of verb subcategorization frames in a large corpus. A tagged corpus is first partially parsed to identify noun phrases and then a regular grammar is used to estimate the appropriate subcategorization frame for each verb token in the corpus. In an experiment involving the identification of six fixed subcategorization frames, our current system showed more than 80% accuracy. In addition, a new statistical method enables the system to learn patterns of errors based on a set of training samples and substantially improves the accuracy of the frequency estimation.

Learning verb subcategorization from corpora: Counting frame subsets

2000

We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we are able to achieve 88 % accuracy on unseen parsed text.