Luis Da Costa - Academia.edu (original) (raw)

Papers by Luis Da Costa

Research paper thumbnail of Using rich models of language in grammatical error detection

Research paper thumbnail of Enchancing the Collaborative Interlingual Index for Digital Humanities: Cross-linguistic Analysis in the Domain of Theology

We aim to support digital humanities work related to the study of sacred texts. To do this, we pr... more We aim to support digital humanities work related to the study of sacred texts. To do this, we propose to build a cross-lingual wordnet within the do-main of theology. We target the Collaborative Interlingual Index (CILI) directly instead of each individual wordnet. The paper presents background for this proposal: (1) an overview of concepts relevant to theology and (2) a summary of the domain-associated issues observed in the Princeton WordNet (PWN). We have found that definitions for concepts in this domain can be too restrictive, inconsistent, and unclear. Necessary synsets are missing, with the PWN being skewed towards Christianity. We argue that tackling problems in a single domain is a better method for improving CILI. By focusing on a single topic rather than a single language, this will result in the proper construction of definitions, romanization/translation of lemmas, and also improvements in use of/creation of a cross-lingual domain hierarchy.

Research paper thumbnail of The Making of Coptic Wordnet

With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Lat... more With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.

Research paper thumbnail of Grammatical error detection using HPSG grammars: Diagnosing common Mandarin Chinese grammatical errors

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar

Computational Grammars can be adapted to detect ungrammatical sentences, effectively transforming... more Computational Grammars can be adapted to detect ungrammatical sentences, effectively transforming them into error detection (or correction) systems. In this paper we provide a theoretical account of how to adapt implemented HPSG grammars for grammatical error detection. We discuss how a single ungrammatical input can be reconstructed in multiple ways and, in turn, be used to provide specific, high-quality feedback to language learners. We then move on to exemplify this with a few of the most common error classes made by learners of Mandarin Chinese. We conclude with some notes concerning the adaptation and implementation of the methods described here in ZHONG, an open-source HPSG grammar for Mandarin Chinese.

Research paper thumbnail of The GlobalWordNet Formats: Updates for 2020

The Global Wordnet Formats have been introduced to enable wordnets to have a common representatio... more The Global Wordnet Formats have been introduced to enable wordnets to have a common representation that can be integrated through the Global WordNet Grid. As a result of their adoption, a number of shortcomings of the format were identified, and in this paper we describe the extensions to the formats that address these issues. These include: ordering of senses, dependencies between wordnets, pronunciation, syntactic modelling, relations, sense keys, metadata and RDF support. Furthermore, we provide some perspectives on how these changes help in the integration of wordnets.

Research paper thumbnail of Some Issues with Building a Multilingual Wordnet

In this paper we discuss the experience of bringing together over 40 different wordnets. We intro... more In this paper we discuss the experience of bringing together over 40 different wordnets. We introduce some extensions to the GWA wordnet LMF format proposed in Vossen et al. (2016) and look at how this new information can be displayed. Notable extensions include: confidence, corpus frequency, orthographic variants, lexicalized and non-lexicalized synsets and lemmas, new parts of speech, and more. Many of these extensions already exist in multiple wordnets – the challenge was to find a compatible representation. To this end, we introduce a new version of the Open Multilingual Wordnet (Bond and Foster, 2013), that integrates a new set of tools that tests the extensions introduced by this new format, while also ensuring the integrity of the Collaborative Interlingual Index (CILI: Bond et al., 2016), avoiding the same new concept to be introduced through multiple projects.

Research paper thumbnail of Wow! What a Useful Extension! Introducing Non-Referential Concepts to Wordnet

In this paper we present the ongoing efforts to expand the depth and breath of the Open Multiling... more In this paper we present the ongoing efforts to expand the depth and breath of the Open Multilingual Wordnet coverage by introducing two new classes of non-referential concepts to wordnet hierarchies: interjections and numeral classifiers. The lexical semantic hierarchy pioneered by Princeton Wordnet has traditionally restricted its coverage to referential and contentful classes of words: such as nouns, verbs, adjectives and adverbs. Previous efforts have been employed to enrich wordnet resources including, for example, the inclusion of pronouns, determiners and quantifiers within their hierarchies. Following similar efforts, and motivated by the ongoing semantic annotation of the NTU-Multilingual Corpus, we decided that the four traditional classes of words present in wordnets were too restrictive. Though non-referential, interjections and classifiers possess interesting semantics features that can be well captured by lexical resources like wordnets. In this paper, we will further ...

Research paper thumbnail of Building the Cantonese Wordnet

This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Ho... more This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Hong Kong Cantonese. It is built using the expansion approach, leveraging on the existing Chinese Open Wordnet, and the Princeton Wordnet’s semantic hierarchy. The main goal of our project was to produce a high quality, human-curated resource – and this paper reports on the initial efforts and steady progress of our building method. It is our belief that the lexical data made available by this wordnet, including Jyutping romanization, will be useful for a variety of future uses, including many language processing tasks and linguistic research on Cantonese and its interactions with other Chinese dialects.

Research paper thumbnail of Syntactic Well-Formedness Diagnosis and Error-Based Coaching in Computer Assisted Language Learning using Machine Translation

We present a novel approach to Computer Assisted Language Learning (CALL), using deep syntactic p... more We present a novel approach to Computer Assisted Language Learning (CALL), using deep syntactic parsers and semantic based machine translation (MT) in diagnosing and providing explicit feedback on language learners’ errors. We are currently developing a proof of concept system showing how semantic-based machine translation can, in conjunction with robust computational grammars, be used to interact with students, better understand their language errors, and help students correct their grammar through a series of useful feedback messages and guided language drills. Ultimately, we aim to prove the viability of a new integrated rule-based MT approach to disambiguate students’ intended meaning in a CALL system. This is a necessary step to provide accurate coaching on how to correct ungrammatical input, and it will allow us to overcome a current bottleneck in the field — an exponential burst of ambiguity caused by ambiguous lexical items (Flickinger, 2010). From the users’ interaction wit...

Research paper thumbnail of CALLIG: Computer Assisted Language Learning using Improvisation Games

In this paper, we present the ongoing development of CALLIG – a web system that uses improvisatio... more In this paper, we present the ongoing development of CALLIG – a web system that uses improvisation games in Computer Assisted Language Learning (CALL). Improvisation games are structured activities with built-in constraints where improvisers are asked to generate a lot of different ideas and weave a diverse range of elements into a sensible narrative spontaneously. This paper discusses how computer-based language games can be created combining improvisation elements and language technology. In contrast with traditional language exercises, improvisational language games are open and unpredictable. CALLIG encourages spontaneity and witty language use. It also provides opportunities for collecting useful data for many NLP applications.

Research paper thumbnail of NTUCLE: Developing a Corpus of Learner English to Provide Writing Support for Engineering Students

This paper describes the creation of a new annotated learner corpus. The aim is to use this corpu... more This paper describes the creation of a new annotated learner corpus. The aim is to use this corpus to develop an automated system for corrective feedback on students’ writing. With this system, students will be able to receive timely feedback on language errors before they submit their assignments for grading. A corpus of assignments submitted by first year engineering students was compiled, and a new error tag set for the NTU Corpus of Learner English (NTUCLE) was developed based on that of the NUS Corpus of Learner English (NUCLE), as well as marking rubrics used at NTU. After a description of the corpus, error tag set and annotation process, the paper presents the results of the annotation exercise as well as follow up actions. The final error tag set, which is significantly larger than that for the NUCLE error categories, is then presented before a brief conclusion summarising our experience and future plans.

Research paper thumbnail of Automated Writing Support Using Deep Linguistic Parsers

This paper introduces a new web system that integrates English Grammatical Error Detection (GED) ... more This paper introduces a new web system that integrates English Grammatical Error Detection (GED) and course-specific stylistic guidelines to automatically review and provide feedback on student assignments. The system is being developed as a pedagogical tool for English Scientific Writing. It uses both general NLP methods and high precision parsers to check student assignments before they are submitted for grading. Instead of generalized error detection, our system aims to identify, with high precision, specific classes of problems that are known to be common among engineering students. Rather than correct the errors, our system generates constructive feedback to help students identify and correct them on their own. A preliminary evaluation of the system’s in-class performance has shown measurable improvements in the quality of student assignments.

Research paper thumbnail of Linking the TUFS Basic Vocabulary to the Open Multilingual Wordnet

We describe the linking of the TUFS Basic Vocabulary Modules, created for online language learnin... more We describe the linking of the TUFS Basic Vocabulary Modules, created for online language learning, with the Open Multilingual Wordnet. The TUFS modules have roughly 500 lexical entries in 30 languages, each with the lemma, a link across the languages, an example sentence, usage notes and sound files. The Open Multilingual Wordnet has 34 languages (11 shared with TUFS) organized into synsets linked by semantic relations, with examples and definitions for some languages. The links can be used to (i) evaluate existing wordnets, (ii) add data to these wordnets and (iii) create new open wordnets for Khmer, Korean, Lao, Mongolian, Russian, Tagalog, Urdua nd Vietnamese

Research paper thumbnail of Basic copula clauses in Indonesian

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar, 2016

We want to show how basic copula clauses in Indonesian can be dealt with within the framework of ... more We want to show how basic copula clauses in Indonesian can be dealt with within the framework of Head Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994). We analyzed three types of basic copula clauses in Indonesian: copula clauses with noun phrase complements (NP) expressing the notions of 'proper inclusion' and 'equation', adjective phrases (AP) expressing 'attribution', and prepositional phrases (PP) expressing relationships such as 'location'. Our analysis is implemented in the Indonesian Resource Grammar (INDRA), a computational grammar for Indonesian (Moeljadi et al., 2015).

Research paper thumbnail of Teaching Through Tagging — Interactive Lexical Semantics

In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sens... more In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sense tagging. The main goal is to lead students to discover how we can represent meaning and where the limits of our current theories lie. A subsidiary goal is to create sense tagged corpora and an accompanying linked lexicon (in our case wordnets). We present the results of tagging several texts and suggest some ways in which the tagging process could be improved. Two authors of this paper present their own experience as students. Overall, students reported that they found the tagging an enriching experience. The annotated corpora and changes to the wordnet are made available through the NTU multilingual corpus and associated wordnets (NTU-MC).

Research paper thumbnail of Mapping and Generating Classifiers using an Open Chinese Ontology

In languages such as Chinese, classifiers (CLs) play a central role in the quantification of noun... more In languages such as Chinese, classifiers (CLs) play a central role in the quantification of noun-phrases. This can be a problem when generating text from input that does not specify the classifier, as in machine translation (MT) from English to Chinese. Many solutions to this problem rely on dictionaries of noun-CL pairs. However, there is no open large-scale machine-tractable dictionary of noun-CL associations. Many published resources exist, but they tend to focus on how a CL is used (e.g. what kinds of nouns can be used with it, or what features seem to be selected by each CL). In fact, since nouns are open class words, producing an exhaustive definite list of noun-CL associations is not possible, since it would quickly get out of date. Our work tries to address this problem by providing an algorithm for automatic building of a frequency based dictionary of noun-CL pairs, mapped to concepts in the Chinese Open Wordnet (Wang and Bond, 2013), an open machine-tractable dictionary f...

Research paper thumbnail of OMWEdit - The Integrated Open Multilingual Wordnet Editing System

Proceedings of ACL-IJCNLP 2015 System Demonstrations, 2015

Wordnets play a central role in many natural language processing tasks. This paper introduces a m... more Wordnets play a central role in many natural language processing tasks. This paper introduces a multilingual editing system for the Open Multilingual Wordnet (OMW: Bond and Foster, 2013). Wordnet development, like most lexicographic tasks, is slow and expensive. Moving away from the original Princeton Wordnet (Fellbaum, 1998) development workflow, wordnet creation and expansion has increasingly been shifting towards an automated and/or interactive system facilitated task. In the particular case of human edition/expansion of wordnets, a few systems have been developed to aid the lexicographers' work. Unfortunately, most of these tools have either restricted licenses, or have been designed with a particular language in mind. We present a webbased system that is capable of multilingual browsing and editing for any of the hundreds of languages made available by the OMW. All tools and guidelines are freely available under an open license.

Research paper thumbnail of IMI –- A Multilingual Semantic Annotation Environment

Proceedings of ACL-IJCNLP 2015 System Demonstrations, 2015

Semantic annotated parallel corpora, though rare, play an increasingly important role in natural ... more Semantic annotated parallel corpora, though rare, play an increasingly important role in natural language processing. These corpora provide valuable data for computational tasks like sense-based machine translation and word sense disambiguation, but also to contrastive linguistics and translation studies. In this paper we present the ongoing development of a web-based corpus semantic annotation environment that uses the Open Multilingual Wordnet (Bond and Foster, 2013) as a sense inventory. The system includes interfaces to help coordinating the annotation project and a corpus browsing interface designed specifically to meet the needs of a semantically annotated corpus. The tool was designed to build the NTU-Multilingual Corpus (Tan and Bond, 2012). For the past six years, our tools have been tested and developed in parallel with the semantic annotation of a portion of this corpus in Chinese, English, Japanese and Indonesian. The annotation system is released under an open source license (MIT).

Research paper thumbnail of The Open Cantonese Sense-Tagged Corpus

Proceedings of the 12th Global Wordnet Conference, 2023

This paper introduces the Open Cantonese Sense-Tagged Corpus, a new and ongoing project to serve ... more This paper introduces the Open Cantonese Sense-Tagged Corpus, a new and ongoing project to serve as the companion to the development of the Cantonese Wordnet. This corpus is built on top of the Cantonese Wordnet Corpus, which currently provides example sentences for most verbs in this wordnet. This paper motivates the choice of starting a sense-tagged corpus from both linguistic and educational perspectives, and discusses the current solutions to issues arisen from the sensetagging exercise. In total, we have tagged over 5,000 concepts, with more than 3,700 direct links to the Cantonese Wordnet.

Research paper thumbnail of Using rich models of language in grammatical error detection

Research paper thumbnail of Enchancing the Collaborative Interlingual Index for Digital Humanities: Cross-linguistic Analysis in the Domain of Theology

We aim to support digital humanities work related to the study of sacred texts. To do this, we pr... more We aim to support digital humanities work related to the study of sacred texts. To do this, we propose to build a cross-lingual wordnet within the do-main of theology. We target the Collaborative Interlingual Index (CILI) directly instead of each individual wordnet. The paper presents background for this proposal: (1) an overview of concepts relevant to theology and (2) a summary of the domain-associated issues observed in the Princeton WordNet (PWN). We have found that definitions for concepts in this domain can be too restrictive, inconsistent, and unclear. Necessary synsets are missing, with the PWN being skewed towards Christianity. We argue that tackling problems in a single domain is a better method for improving CILI. By focusing on a single topic rather than a single language, this will result in the proper construction of definitions, romanization/translation of lemmas, and also improvements in use of/creation of a cross-lingual domain hierarchy.

Research paper thumbnail of The Making of Coptic Wordnet

With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Lat... more With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.

Research paper thumbnail of Grammatical error detection using HPSG grammars: Diagnosing common Mandarin Chinese grammatical errors

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar

Computational Grammars can be adapted to detect ungrammatical sentences, effectively transforming... more Computational Grammars can be adapted to detect ungrammatical sentences, effectively transforming them into error detection (or correction) systems. In this paper we provide a theoretical account of how to adapt implemented HPSG grammars for grammatical error detection. We discuss how a single ungrammatical input can be reconstructed in multiple ways and, in turn, be used to provide specific, high-quality feedback to language learners. We then move on to exemplify this with a few of the most common error classes made by learners of Mandarin Chinese. We conclude with some notes concerning the adaptation and implementation of the methods described here in ZHONG, an open-source HPSG grammar for Mandarin Chinese.

Research paper thumbnail of The GlobalWordNet Formats: Updates for 2020

The Global Wordnet Formats have been introduced to enable wordnets to have a common representatio... more The Global Wordnet Formats have been introduced to enable wordnets to have a common representation that can be integrated through the Global WordNet Grid. As a result of their adoption, a number of shortcomings of the format were identified, and in this paper we describe the extensions to the formats that address these issues. These include: ordering of senses, dependencies between wordnets, pronunciation, syntactic modelling, relations, sense keys, metadata and RDF support. Furthermore, we provide some perspectives on how these changes help in the integration of wordnets.

Research paper thumbnail of Some Issues with Building a Multilingual Wordnet

In this paper we discuss the experience of bringing together over 40 different wordnets. We intro... more In this paper we discuss the experience of bringing together over 40 different wordnets. We introduce some extensions to the GWA wordnet LMF format proposed in Vossen et al. (2016) and look at how this new information can be displayed. Notable extensions include: confidence, corpus frequency, orthographic variants, lexicalized and non-lexicalized synsets and lemmas, new parts of speech, and more. Many of these extensions already exist in multiple wordnets – the challenge was to find a compatible representation. To this end, we introduce a new version of the Open Multilingual Wordnet (Bond and Foster, 2013), that integrates a new set of tools that tests the extensions introduced by this new format, while also ensuring the integrity of the Collaborative Interlingual Index (CILI: Bond et al., 2016), avoiding the same new concept to be introduced through multiple projects.

Research paper thumbnail of Wow! What a Useful Extension! Introducing Non-Referential Concepts to Wordnet

In this paper we present the ongoing efforts to expand the depth and breath of the Open Multiling... more In this paper we present the ongoing efforts to expand the depth and breath of the Open Multilingual Wordnet coverage by introducing two new classes of non-referential concepts to wordnet hierarchies: interjections and numeral classifiers. The lexical semantic hierarchy pioneered by Princeton Wordnet has traditionally restricted its coverage to referential and contentful classes of words: such as nouns, verbs, adjectives and adverbs. Previous efforts have been employed to enrich wordnet resources including, for example, the inclusion of pronouns, determiners and quantifiers within their hierarchies. Following similar efforts, and motivated by the ongoing semantic annotation of the NTU-Multilingual Corpus, we decided that the four traditional classes of words present in wordnets were too restrictive. Though non-referential, interjections and classifiers possess interesting semantics features that can be well captured by lexical resources like wordnets. In this paper, we will further ...

Research paper thumbnail of Building the Cantonese Wordnet

This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Ho... more This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Hong Kong Cantonese. It is built using the expansion approach, leveraging on the existing Chinese Open Wordnet, and the Princeton Wordnet’s semantic hierarchy. The main goal of our project was to produce a high quality, human-curated resource – and this paper reports on the initial efforts and steady progress of our building method. It is our belief that the lexical data made available by this wordnet, including Jyutping romanization, will be useful for a variety of future uses, including many language processing tasks and linguistic research on Cantonese and its interactions with other Chinese dialects.

Research paper thumbnail of Syntactic Well-Formedness Diagnosis and Error-Based Coaching in Computer Assisted Language Learning using Machine Translation

We present a novel approach to Computer Assisted Language Learning (CALL), using deep syntactic p... more We present a novel approach to Computer Assisted Language Learning (CALL), using deep syntactic parsers and semantic based machine translation (MT) in diagnosing and providing explicit feedback on language learners’ errors. We are currently developing a proof of concept system showing how semantic-based machine translation can, in conjunction with robust computational grammars, be used to interact with students, better understand their language errors, and help students correct their grammar through a series of useful feedback messages and guided language drills. Ultimately, we aim to prove the viability of a new integrated rule-based MT approach to disambiguate students’ intended meaning in a CALL system. This is a necessary step to provide accurate coaching on how to correct ungrammatical input, and it will allow us to overcome a current bottleneck in the field — an exponential burst of ambiguity caused by ambiguous lexical items (Flickinger, 2010). From the users’ interaction wit...

Research paper thumbnail of CALLIG: Computer Assisted Language Learning using Improvisation Games

In this paper, we present the ongoing development of CALLIG – a web system that uses improvisatio... more In this paper, we present the ongoing development of CALLIG – a web system that uses improvisation games in Computer Assisted Language Learning (CALL). Improvisation games are structured activities with built-in constraints where improvisers are asked to generate a lot of different ideas and weave a diverse range of elements into a sensible narrative spontaneously. This paper discusses how computer-based language games can be created combining improvisation elements and language technology. In contrast with traditional language exercises, improvisational language games are open and unpredictable. CALLIG encourages spontaneity and witty language use. It also provides opportunities for collecting useful data for many NLP applications.

Research paper thumbnail of NTUCLE: Developing a Corpus of Learner English to Provide Writing Support for Engineering Students

This paper describes the creation of a new annotated learner corpus. The aim is to use this corpu... more This paper describes the creation of a new annotated learner corpus. The aim is to use this corpus to develop an automated system for corrective feedback on students’ writing. With this system, students will be able to receive timely feedback on language errors before they submit their assignments for grading. A corpus of assignments submitted by first year engineering students was compiled, and a new error tag set for the NTU Corpus of Learner English (NTUCLE) was developed based on that of the NUS Corpus of Learner English (NUCLE), as well as marking rubrics used at NTU. After a description of the corpus, error tag set and annotation process, the paper presents the results of the annotation exercise as well as follow up actions. The final error tag set, which is significantly larger than that for the NUCLE error categories, is then presented before a brief conclusion summarising our experience and future plans.

Research paper thumbnail of Automated Writing Support Using Deep Linguistic Parsers

This paper introduces a new web system that integrates English Grammatical Error Detection (GED) ... more This paper introduces a new web system that integrates English Grammatical Error Detection (GED) and course-specific stylistic guidelines to automatically review and provide feedback on student assignments. The system is being developed as a pedagogical tool for English Scientific Writing. It uses both general NLP methods and high precision parsers to check student assignments before they are submitted for grading. Instead of generalized error detection, our system aims to identify, with high precision, specific classes of problems that are known to be common among engineering students. Rather than correct the errors, our system generates constructive feedback to help students identify and correct them on their own. A preliminary evaluation of the system’s in-class performance has shown measurable improvements in the quality of student assignments.

Research paper thumbnail of Linking the TUFS Basic Vocabulary to the Open Multilingual Wordnet

We describe the linking of the TUFS Basic Vocabulary Modules, created for online language learnin... more We describe the linking of the TUFS Basic Vocabulary Modules, created for online language learning, with the Open Multilingual Wordnet. The TUFS modules have roughly 500 lexical entries in 30 languages, each with the lemma, a link across the languages, an example sentence, usage notes and sound files. The Open Multilingual Wordnet has 34 languages (11 shared with TUFS) organized into synsets linked by semantic relations, with examples and definitions for some languages. The links can be used to (i) evaluate existing wordnets, (ii) add data to these wordnets and (iii) create new open wordnets for Khmer, Korean, Lao, Mongolian, Russian, Tagalog, Urdua nd Vietnamese

Research paper thumbnail of Basic copula clauses in Indonesian

Proceedings of the International Conference on Head-Driven Phrase Structure Grammar, 2016

We want to show how basic copula clauses in Indonesian can be dealt with within the framework of ... more We want to show how basic copula clauses in Indonesian can be dealt with within the framework of Head Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1994). We analyzed three types of basic copula clauses in Indonesian: copula clauses with noun phrase complements (NP) expressing the notions of 'proper inclusion' and 'equation', adjective phrases (AP) expressing 'attribution', and prepositional phrases (PP) expressing relationships such as 'location'. Our analysis is implemented in the Indonesian Resource Grammar (INDRA), a computational grammar for Indonesian (Moeljadi et al., 2015).

Research paper thumbnail of Teaching Through Tagging — Interactive Lexical Semantics

In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sens... more In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sense tagging. The main goal is to lead students to discover how we can represent meaning and where the limits of our current theories lie. A subsidiary goal is to create sense tagged corpora and an accompanying linked lexicon (in our case wordnets). We present the results of tagging several texts and suggest some ways in which the tagging process could be improved. Two authors of this paper present their own experience as students. Overall, students reported that they found the tagging an enriching experience. The annotated corpora and changes to the wordnet are made available through the NTU multilingual corpus and associated wordnets (NTU-MC).

Research paper thumbnail of Mapping and Generating Classifiers using an Open Chinese Ontology

In languages such as Chinese, classifiers (CLs) play a central role in the quantification of noun... more In languages such as Chinese, classifiers (CLs) play a central role in the quantification of noun-phrases. This can be a problem when generating text from input that does not specify the classifier, as in machine translation (MT) from English to Chinese. Many solutions to this problem rely on dictionaries of noun-CL pairs. However, there is no open large-scale machine-tractable dictionary of noun-CL associations. Many published resources exist, but they tend to focus on how a CL is used (e.g. what kinds of nouns can be used with it, or what features seem to be selected by each CL). In fact, since nouns are open class words, producing an exhaustive definite list of noun-CL associations is not possible, since it would quickly get out of date. Our work tries to address this problem by providing an algorithm for automatic building of a frequency based dictionary of noun-CL pairs, mapped to concepts in the Chinese Open Wordnet (Wang and Bond, 2013), an open machine-tractable dictionary f...

Research paper thumbnail of OMWEdit - The Integrated Open Multilingual Wordnet Editing System

Proceedings of ACL-IJCNLP 2015 System Demonstrations, 2015

Wordnets play a central role in many natural language processing tasks. This paper introduces a m... more Wordnets play a central role in many natural language processing tasks. This paper introduces a multilingual editing system for the Open Multilingual Wordnet (OMW: Bond and Foster, 2013). Wordnet development, like most lexicographic tasks, is slow and expensive. Moving away from the original Princeton Wordnet (Fellbaum, 1998) development workflow, wordnet creation and expansion has increasingly been shifting towards an automated and/or interactive system facilitated task. In the particular case of human edition/expansion of wordnets, a few systems have been developed to aid the lexicographers' work. Unfortunately, most of these tools have either restricted licenses, or have been designed with a particular language in mind. We present a webbased system that is capable of multilingual browsing and editing for any of the hundreds of languages made available by the OMW. All tools and guidelines are freely available under an open license.

Research paper thumbnail of IMI –- A Multilingual Semantic Annotation Environment

Proceedings of ACL-IJCNLP 2015 System Demonstrations, 2015

Semantic annotated parallel corpora, though rare, play an increasingly important role in natural ... more Semantic annotated parallel corpora, though rare, play an increasingly important role in natural language processing. These corpora provide valuable data for computational tasks like sense-based machine translation and word sense disambiguation, but also to contrastive linguistics and translation studies. In this paper we present the ongoing development of a web-based corpus semantic annotation environment that uses the Open Multilingual Wordnet (Bond and Foster, 2013) as a sense inventory. The system includes interfaces to help coordinating the annotation project and a corpus browsing interface designed specifically to meet the needs of a semantically annotated corpus. The tool was designed to build the NTU-Multilingual Corpus (Tan and Bond, 2012). For the past six years, our tools have been tested and developed in parallel with the semantic annotation of a portion of this corpus in Chinese, English, Japanese and Indonesian. The annotation system is released under an open source license (MIT).

Research paper thumbnail of The Open Cantonese Sense-Tagged Corpus

Proceedings of the 12th Global Wordnet Conference, 2023

This paper introduces the Open Cantonese Sense-Tagged Corpus, a new and ongoing project to serve ... more This paper introduces the Open Cantonese Sense-Tagged Corpus, a new and ongoing project to serve as the companion to the development of the Cantonese Wordnet. This corpus is built on top of the Cantonese Wordnet Corpus, which currently provides example sentences for most verbs in this wordnet. This paper motivates the choice of starting a sense-tagged corpus from both linguistic and educational perspectives, and discusses the current solutions to issues arisen from the sensetagging exercise. In total, we have tagged over 5,000 concepts, with more than 3,700 direct links to the Cantonese Wordnet.