Luis Da Costa - Academia.edu (original) (raw)

Papers by Luis Da Costa

We aim to support digital humanities work related to the study of sacred texts. To do this, we pr... more We aim to support digital humanities work related to the study of sacred texts. To do this, we propose to build a cross-lingual wordnet within the do-main of theology. We target the Collaborative Interlingual Index (CILI) directly instead of each individual wordnet. The paper presents background for this proposal: (1) an overview of concepts relevant to theology and (2) a summary of the domain-associated issues observed in the Princeton WordNet (PWN). We have found that definitions for concepts in this domain can be too restrictive, inconsistent, and unclear. Necessary synsets are missing, with the PWN being skewed towards Christianity. We argue that tackling problems in a single domain is a better method for improving CILI. By focusing on a single topic rather than a single language, this will result in the proper construction of definitions, romanization/translation of lemmas, and also improvements in use of/creation of a cross-lingual domain hierarchy.

With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Lat... more With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar

Computational Grammars can be adapted to detect ungrammatical sentences, effectively transforming... more Computational Grammars can be adapted to detect ungrammatical sentences, effectively transforming them into error detection (or correction) systems. In this paper we provide a theoretical account of how to adapt implemented HPSG grammars for grammatical error detection. We discuss how a single ungrammatical input can be reconstructed in multiple ways and, in turn, be used to provide specific, high-quality feedback to language learners. We then move on to exemplify this with a few of the most common error classes made by learners of Mandarin Chinese. We conclude with some notes concerning the adaptation and implementation of the methods described here in ZHONG, an open-source HPSG grammar for Mandarin Chinese.

The Global Wordnet Formats have been introduced to enable wordnets to have a common representatio... more The Global Wordnet Formats have been introduced to enable wordnets to have a common representation that can be integrated through the Global WordNet Grid. As a result of their adoption, a number of shortcomings of the format were identified, and in this paper we describe the extensions to the formats that address these issues. These include: ordering of senses, dependencies between wordnets, pronunciation, syntactic modelling, relations, sense keys, metadata and RDF support. Furthermore, we provide some perspectives on how these changes help in the integration of wordnets.

In this paper we discuss the experience of bringing together over 40 different wordnets. We intro... more In this paper we discuss the experience of bringing together over 40 different wordnets. We introduce some extensions to the GWA wordnet LMF format proposed in Vossen et al. (2016) and look at how this new information can be displayed. Notable extensions include: confidence, corpus frequency, orthographic variants, lexicalized and non-lexicalized synsets and lemmas, new parts of speech, and more. Many of these extensions already exist in multiple wordnets – the challenge was to find a compatible representation. To this end, we introduce a new version of the Open Multilingual Wordnet (Bond and Foster, 2013), that integrates a new set of tools that tests the extensions introduced by this new format, while also ensuring the integrity of the Collaborative Interlingual Index (CILI: Bond et al., 2016), avoiding the same new concept to be introduced through multiple projects.

In this paper we present the ongoing efforts to expand the depth and breath of the Open Multiling... more In this paper we present the ongoing efforts to expand the depth and breath of the Open Multilingual Wordnet coverage by introducing two new classes of non-referential concepts to wordnet hierarchies: interjections and numeral classifiers. The lexical semantic hierarchy pioneered by Princeton Wordnet has traditionally restricted its coverage to referential and contentful classes of words: such as nouns, verbs, adjectives and adverbs. Previous efforts have been employed to enrich wordnet resources including, for example, the inclusion of pronouns, determiners and quantifiers within their hierarchies. Following similar efforts, and motivated by the ongoing semantic annotation of the NTU-Multilingual Corpus, we decided that the four traditional classes of words present in wordnets were too restrictive. Though non-referential, interjections and classifiers possess interesting semantics features that can be well captured by lexical resources like wordnets. In this paper, we will further ...

This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Ho... more This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Hong Kong Cantonese. It is built using the expansion approach, leveraging on the existing Chinese Open Wordnet, and the Princeton Wordnet’s semantic hierarchy. The main goal of our project was to produce a high quality, human-curated resource – and this paper reports on the initial efforts and steady progress of our building method. It is our belief that the lexical data made available by this wordnet, including Jyutping romanization, will be useful for a variety of future uses, including many language processing tasks and linguistic research on Cantonese and its interactions with other Chinese dialects.

We present a novel approach to Computer Assisted Language Learning (CALL), using deep syntactic p... more We present a novel approach to Computer Assisted Language Learning (CALL), using deep syntactic parsers and semantic based machine translation (MT) in diagnosing and providing explicit feedback on language learners’ errors. We are currently developing a proof of concept system showing how semantic-based machine translation can, in conjunction with robust computational grammars, be used to interact with students, better understand their language errors, and help students correct their grammar through a series of useful feedback messages and guided language drills. Ultimately, we aim to prove the viability of a new integrated rule-based MT approach to disambiguate students’ intended meaning in a CALL system. This is a necessary step to provide accurate coaching on how to correct ungrammatical input, and it will allow us to overcome a current bottleneck in the field — an exponential burst of ambiguity caused by ambiguous lexical items (Flickinger, 2010). From the users’ interaction wit...

In this paper, we present the ongoing development of CALLIG – a web system that uses improvisatio... more In this paper, we present the ongoing development of CALLIG – a web system that uses improvisation games in Computer Assisted Language Learning (CALL). Improvisation games are structured activities with built-in constraints where improvisers are asked to generate a lot of different ideas and weave a diverse range of elements into a sensible narrative spontaneously. This paper discusses how computer-based language games can be created combining improvisation elements and language technology. In contrast with traditional language exercises, improvisational language games are open and unpredictable. CALLIG encourages spontaneity and witty language use. It also provides opportunities for collecting useful data for many NLP applications.

This paper describes the creation of a new annotated learner corpus. The aim is to use this corpu... more This paper describes the creation of a new annotated learner corpus. The aim is to use this corpus to develop an automated system for corrective feedback on students’ writing. With this system, students will be able to receive timely feedback on language errors before they submit their assignments for grading. A corpus of assignments submitted by first year engineering students was compiled, and a new error tag set for the NTU Corpus of Learner English (NTUCLE) was developed based on that of the NUS Corpus of Learner English (NUCLE), as well as marking rubrics used at NTU. After a description of the corpus, error tag set and annotation process, the paper presents the results of the annotation exercise as well as follow up actions. The final error tag set, which is significantly larger than that for the NUCLE error categories, is then presented before a brief conclusion summarising our experience and future plans.

This paper introduces a new web system that integrates English Grammatical Error Detection (GED) ... more This paper introduces a new web system that integrates English Grammatical Error Detection (GED) and course-specific stylistic guidelines to automatically review and provide feedback on student assignments. The system is being developed as a pedagogical tool for English Scientific Writing. It uses both general NLP methods and high precision parsers to check student assignments before they are submitted for grading. Instead of generalized error detection, our system aims to identify, with high precision, specific classes of problems that are known to be common among engineering students. Rather than correct the errors, our system generates constructive feedback to help students identify and correct them on their own. A preliminary evaluation of the system’s in-class performance has shown measurable improvements in the quality of student assignments.

We describe the linking of the TUFS Basic Vocabulary Modules, created for online language learnin... more We describe the linking of the TUFS Basic Vocabulary Modules, created for online language learning, with the Open Multilingual Wordnet. The TUFS modules have roughly 500 lexical entries in 30 languages, each with the lemma, a link across the languages, an example sentence, usage notes and sound files. The Open Multilingual Wordnet has 34 languages (11 shared with TUFS) organized into synsets linked by semantic relations, with examples and definitions for some languages. The links can be used to (i) evaluate existing wordnets, (ii) add data to these wordnets and (iii) create new open wordnets for Khmer, Korean, Lao, Mongolian, Russian, Tagalog, Urdua nd Vietnamese

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar, 2016

We want to show how basic copula clauses in Indonesian can be dealt with within the framework of ... more We want to show how basic copula clauses in Indonesian can be dealt with within the framework of Head Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994). We analyzed three types of basic copula clauses in Indonesian: copula clauses with noun phrase complements (NP) expressing the notions of 'proper inclusion' and 'equation', adjective phrases (AP) expressing 'attribution', and prepositional phrases (PP) expressing relationships such as 'location'. Our analysis is implemented in the Indonesian Resource Grammar (INDRA), a computational grammar for Indonesian (Moeljadi et al., 2015).

In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sens... more In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sense tagging. The main goal is to lead students to discover how we can represent meaning and where the limits of our current theories lie. A subsidiary goal is to create sense tagged corpora and an accompanying linked lexicon (in our case wordnets). We present the results of tagging several texts and suggest some ways in which the tagging process could be improved. Two authors of this paper present their own experience as students. Overall, students reported that they found the tagging an enriching experience. The annotated corpora and changes to the wordnet are made available through the NTU multilingual corpus and associated wordnets (NTU-MC).

In languages such as Chinese, classifiers (CLs) play a central role in the quantification of noun... more In languages such as Chinese, classifiers (CLs) play a central role in the quantification of noun-phrases. This can be a problem when generating text from input that does not specify the classifier, as in machine translation (MT) from English to Chinese. Many solutions to this problem rely on dictionaries of noun-CL pairs. However, there is no open large-scale machine-tractable dictionary of noun-CL associations. Many published resources exist, but they tend to focus on how a CL is used (e.g. what kinds of nouns can be used with it, or what features seem to be selected by each CL). In fact, since nouns are open class words, producing an exhaustive definite list of noun-CL associations is not possible, since it would quickly get out of date. Our work tries to address this problem by providing an algorithm for automatic building of a frequency based dictionary of noun-CL pairs, mapped to concepts in the Chinese Open Wordnet (Wang and Bond, 2013), an open machine-tractable dictionary f...

Proceedings of ACL-IJCNLP 2015 System Demonstrations, 2015

Wordnets play a central role in many natural language processing tasks. This paper introduces a m... more Wordnets play a central role in many natural language processing tasks. This paper introduces a multilingual editing system for the Open Multilingual Wordnet (OMW: Bond and Foster, 2013). Wordnet development, like most lexicographic tasks, is slow and expensive. Moving away from the original Princeton Wordnet (Fellbaum, 1998) development workflow, wordnet creation and expansion has increasingly been shifting towards an automated and/or interactive system facilitated task. In the particular case of human edition/expansion of wordnets, a few systems have been developed to aid the lexicographers' work. Unfortunately, most of these tools have either restricted licenses, or have been designed with a particular language in mind. We present a webbased system that is capable of multilingual browsing and editing for any of the hundreds of languages made available by the OMW. All tools and guidelines are freely available under an open license.

Proceedings of ACL-IJCNLP 2015 System Demonstrations, 2015

Semantic annotated parallel corpora, though rare, play an increasingly important role in natural ... more Semantic annotated parallel corpora, though rare, play an increasingly important role in natural language processing. These corpora provide valuable data for computational tasks like sense-based machine translation and word sense disambiguation, but also to contrastive linguistics and translation studies. In this paper we present the ongoing development of a web-based corpus semantic annotation environment that uses the Open Multilingual Wordnet (Bond and Foster, 2013) as a sense inventory. The system includes interfaces to help coordinating the annotation project and a corpus browsing interface designed specifically to meet the needs of a semantically annotated corpus. The tool was designed to build the NTU-Multilingual Corpus (Tan and Bond, 2012). For the past six years, our tools have been tested and developed in parallel with the semantic annotation of a portion of this corpus in Chinese, English, Japanese and Indonesian. The annotation system is released under an open source license (MIT).

Proceedings of the 12th Global Wordnet Conference, 2023

This paper introduces the Open Cantonese Sense-Tagged Corpus, a new and ongoing project to serve ... more This paper introduces the Open Cantonese Sense-Tagged Corpus, a new and ongoing project to serve as the companion to the development of the Cantonese Wordnet. This corpus is built on top of the Cantonese Wordnet Corpus, which currently provides example sentences for most verbs in this wordnet. This paper motivates the choice of starting a sense-tagged corpus from both linguistic and educational perspectives, and discusses the current solutions to issues arisen from the sensetagging exercise. In total, we have tagged over 5,000 concepts, with more than 3,700 direct links to the Cantonese Wordnet.

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar, 2016

Proceedings of ACL-IJCNLP 2015 System Demonstrations, 2015

Proceedings of the 12th Global Wordnet Conference, 2023