The Open Cantonese Sense-Tagged Corpus (original) (raw)
Related papers
Building the Cantonese Wordnet
Proceedings of the Tenth Global Wordnet Conference, 2019
This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Hong Kong Cantonese. It is built using the expansion approach, lever-aging on the existing Chinese Open Wordnet, and the Princeton Wordnet's semantic hierarchy. The main goal of our project was to produce a high quality, human-curated resource-and this paper reports on the initial efforts and steady progress of our building method. It is our belief that the lexical data made available by this wordnet, including Jyutping romaniza-tion, will be useful for a variety of future uses, including many language processing tasks and linguistic research on Cantonese and its interactions with other Chinese dialects.
Building the chinese open wordnet (cow): Starting from core synsets
Princeton WordNet (PWN) is one of the most influential resources for semantic descriptions, and is extensively used in natural language processing. Based on PWN, three Chinese wordnets have been developed: Sinica Bilingual Ontological Wordnet (BOW), Southeast University WordNet (SEW), and Taiwan University WordNet (CWN). We used SEW to sense-tag a corpus, but found some issues with coverage and precision. We decided to make a new Chinese wordnet based on SEW to increase the coverage and accuracy. In addition, a small scale Chinese wordnet was constructed from open multilingual wordnet (OMW) using data from Wiktionary (WIKT). We then merged SEW and WIKT. Starting from core synsets, we formulated guidelines for the new Chinese Open Wordnet (COW). We compared the five Chinese wordnets, which shows that COW is currently the best, but it still has room for further improvement, especially with polysemous words. It is clear that building an accurate semantic resource for a language is not an easy task, but through consistent efforts, we will be able to achieve it. COW is released under the same license as the PWN, an open license that freely allows use, adaptation and redistribution.
Machine Translation, 2000
This paper addresses the problem of automatic acquisition of lexical knowledge for rapid construction of engines for machine translation and embedded multilingual applications. We describe new techniques for large-scale construction of a Chinese-English verb lexicon and we evaluate the coverage and effectiveness of the resulting lexicon. Leveraging off an existing Chinese conceptual database called HowNet and a large, semantically rich English verb database, we use thematic-role information to create links between Chinese concepts and English classes. We apply the metrics of recall and precision to evaluate the coverage and effectiveness of the linguistic resources. The results of this work indicate that: (a) we are able to obtain reliable Chinese-English entries both with and without pre-existing semantic links between the two languages; (b) if we have pre-existing semantic links, we are able to produce a more robust lexical resource by merging these with our semantically rich English database; (c) in our comparisons with manual lexicon creation, our automatic techniques were shown to achieve 62% precision, compared to a much lower precision of 10% for arbitrary assignment of semantic links.
Comparing Sense Categorization Between English PropBank and English WordNet
Given the fact that verbs play a crucial role in language comprehension, this paper presents a study which compares the verb senses in English PropBank with the ones in English WordNet through manual tagging. After analyzing 1554 senses in 1453 distinct verbs, we have found out that while the majority of the senses in Prop-Bank have their one-to-one correspondents in WordNet, a substantial amount of them are differentiated. Furthermore, by analysing the differences between our manually-tagged and an automatically-tagged resource, we claim that manual tagging can help provide better results in sense annotation.
Verb Semantics for English Chinese Translation
A common practice in operational Machine Translation (MT) and Natural Language Processing (NLP) systems is to assume that a verb has a fixed number of senses and rely on a precompiled lexicon to achieve large coverage. This paper demonstrates that this assumption is too weak to cope with the similar problems of lexical divergences between languages and unexpected uses of words that give rise to cases outside of the precompiled lexicon coverage. We first examine the lexical divergences between English verbs and Chinese verbs. We then focus on a specific lexical selection problem -translating English change-of-state verbs into Chinese verb compounds. We show that an accurate translation depends not only on information about the participants, but also on contextual information. Therefore, selectional restrictions on verb arguments lack the necessary power for accurate lexical selection. Second, we examine verb representation theories and practices in MT systems and show that under the fixed sense assumption, the existing representation schemes are not adequate for handling these lexical divergences and extending existing verb senses to unexpected usages. We then propose a method of verb representation based on conceptual lattices which allows the similarities among different verbs in different languages to be quantitatively measured. A prototype system UNICON implements this theory and performs more accurate MT lexical selection for our chosen set of verbs. An additional lexical module for UNICON is also provided that handles sense extension.
This paper outlines the creation of an open combined semantic lexicon as a resource for the study of lexical semantics in the Malay languages (Malaysian and Indonesian). It is created by combining three earlier wordnets, each built using different resources and approaches: the Malay Wordnet (Lim & Hussein 2006), the Indonesian Wordnet (Riza, Budiono & Hakim 2010) and the Wordnet Bahasa (Nurril Hirfana, Sapuan & Bond 2011). The final wordnet has been validated and extended as part of sense annotation of the Indonesian portion of the NTU Multilingual Corpus (Tan & Bond 2012). The wordnet has over 48,000 concepts and 58,000 words for Indonesian and 38,000 concepts and 45,000 words for Malaysian.
Creating the Open Wordnet Bahasa
2011
This paper outlines the creation of the Wordnet Bahasa as a resource for the study of lexical semantics in the Malay language. It is created by combining information from several lexical resources: the French-English-Malay dictionary FEM, the KAmus Melayu-Inggeris KAMI, and wordnets for English, French and Chinese. Construction went through three steps: (i) automatic building of word candidates; (ii) evaluation and selection of acceptable candidates from merging of lexicons; (iii) final hand check of the 5,000 core synsets. Our Wordnet Bahasa is only in the first phase of building a full fledged wordNet and needs to be further expanded, however it is already large enough to be useful for sense tagging both Malay and Indonesian.
Construction of a VerbNet style lexicon for Vietnamese
2020
Lexical resources like VerbNet (Kipper et al., 2006) or similar lexicons play an important role in the applications involving semantic understanding. For Vietnamese, the currently available computational lexicon (VCL) includes morpho-syntactic information for each lexical entry, subcategorization frames and also some semantic constraints for each verb. However, the information related to verb meaning and behaviors is still very far from complete. In this paper, we present our work on the construction of a VerbNet style lexicon for Vietnamese, called viVerbNet, in order to make available a fundamental lexical resource for semantic analysis of Vietnamese language. Each verb entry in viVerbNet is extracted from VCL, and enriched with various information acquired automatically from Vietnamese corpora as well as manually from a comparative investigation of the English VerbNet. At the current stage, we have built semantic components for 50 verb groups resulted from the application of a cl...
Proceedings of the Eighth Workshop on Asian Language Resouces}
Proceedings of the Eighth Workshop on Asian Language Resouces}
In this paper we propose a framework of verb semantic description in order to organize different granularity of similarity between verbs. Since verb meanings highly depend on their arguments we propose a verb thesaurus on the basis of possible shared meanings with predicate-argument structure. Motivations of this work are to (1) construct a practical lexicon for dealing with alternations, paraphrases and entailment relations between predicates, and (2) provide a basic database for statistical learning system as well as a theoretical lexicon study such as Generative Lexicon and Lexical Conceptual Structure. One of the characteristics of our description is that we assume several granularities of semantic classes to characterize verb meanings. The thesaurus form allows us to provide several granularities of shared meanings; thus, this gives us a further revision for applying more detailed analyses of verb meanings.