DMoG : A Data-Based Morphological Guesser (original) (raw)

Sign language lexicography: a case study of an online dictionary

Slovenščina 2.0: empirical, applied and interdisciplinary research, 2021

As a growing field of study within sign language linguistics, sign language lexicography faces many challenges that have already been answered for audio-oral language material. In this paper, we present some of these challenges and methods developed to help navigate the complex lexical classification field. The described methods and strategies are implemented in the first Czech sign language (ČZJ) online dictionary, a part of the platform Dictio, developed at Masaryk University in Brno. We cover the topic of lemmatisation and how to decide what constitutes a lexeme in sign language. We introduce four types of expressions that qualify for a dictionary entry: a simple lexeme, a compound, a derivative, and a set phrase. We address the question of the place of classifier constructions and shape and size specifiers in a dictionary, given their peculiar semantic status. We maintain the standard classification of classifiers (whole entity and holding classifiers) and size and shape specifi...

Prototype machine translation system from text-to-Indian sign language

Proceedings of the 13th international conference on Intelligent user interfaces - IUI '08, 2008

We know that the distribution of most of the linguistic entities (e.g. phones, words, grammar rules) follow a power law or the Zipf's law. This makes NLP hard. Interestingly, the distribution of speakers over the world, content over the web and linguistic resources available across languages also follow power law. However, the correlation between the distribution of number of speakers to that of web content and linguistic resources is rather poor, and the latter distributions are much more skewed than the former. In other words, there is a large volume of resources only for a very few languages and a large number of widely spoken languages, including all the Indian languages, have little or no linguistic resource at all. This is a serious challenge for NLP in these languages, primarily because state-of-the-art techniques and tools in NLP are all data-driven. I refer to this situation as the "Zipfian Barrier of NLP" and offer a mathematical analysis of the growth dynamics of the linguistic resources and NLP research worldwide, which, afterall, is very much a socioeconomic process. Based on the analysis and otherwise, I propose certain technical (e.g. unsupervised learning, wiki based approaches to gather data) and community-wide (e.g. acceptance of language specific works and resource building projects in top NLP conferences/journals, Special Interest Groups) initiatives that could possibly break this Zipfian Barrier.

Multimodal Corpus Lexicography: Compiling a Corpus-based Bilingual Modern Greek—Greek Sign Language Dictionary

2018

This paper describes the process of compiling NOEMA+, a bilingual dictionary of approximately 12,000 entries for the pair Greek Sign Language (GSL)-Modern Greek (MG) and of making it available openly online (http://sign.ilsp.gr/signilsp-site/index.php/el/noima/). The dictionary was based on several corpora that have been collected over the years, including information on compounding, GSL synonyms, classifiers, various lemma-related senses, semantic relationships, etc. These different corpora have been joined, normalized and translated into MG to form a parallel corpus of the language pair in question. In turn, this parallel corpus acted as the basis for the compilation of the bilingual dictionary described in this paper. More specifically, among the issues to be discussed here are lemma identification, which proved far from intuitive for this particular language pair, lemma categorization, dictionary contents and structure, relations between entries as well as the corpus which was used for dictionary compilation. Finally, there will be a description of the different search choices offered, which cater for different user profiles and needs.

Search-By-Example in Multilingual Sign Language Databases

2011

ABSTRACT We describe a prototype Search-by-Example or look-up tool for signs, based on a newly developed 1000-concept sign lexicon for four national sign languages (GSL, DGS, LSF, BSL), which includes a spoken language gloss, a HamNoSys description, and a video for each sign. The look-up tool combines an interactive sign recognition system, supported by KinectTMtechnology, with a real-time sign synthesis system, using a virtual human signer, to present results to the user.

Support of Arabic Sign Language Machine Translation based on Morphological processing

International Journal of Computer Applications

This paper presents a morphological processing system as a part of arabic text to arabic sign language machine translation system. This morphological processing depends on Farasa analyzer tool, Stanford model and Arramooz lexicon. The characteristics of sign language are achieved to get intermediate arabic sign language sentences. Then these sentences are searched in a sign language dictionary word by word to display the related signs images if available, or to display letters of word using finger spelling alphabet images. The proposed system is tested on many non-vowelized arabic sentences, and good results and high accuracy are obtained.

Creating Corpora of Finland's Sign Languages

This paper discusses the process of creating corpora of the sign languages used in Finland, Finnish Sign Language (FinSL) and Finland-Swedish Sign Language (FinSSL). It describes the process of getting informants and data, editing and storing the data, the general principles of annotation, and the creation of a web-based lexical database, the FinSL Signbank, developed on the basis of the NGT Signbank, which is a branch of the Auslan Signbank. The corpus project of Finland's Sign Languages (CFINSL) started in 2014 at the Sign Language Centre of the University of Jyväskylä. Its aim is to collect conversations and narrations from 80 FinSL users and 20 FinSSL users who are living in different parts of Finland. The participants are filmed in signing sessions led by a native signer in the Audiovisual Research Centre at the University of Jyväskylä. The edited material is stored in the storage service provided by the CSC – IT Center for Science, and the metadata will be saved into CMDI metadata. Every informant is asked to sign a consent form where they state for what kinds of purposes their signing can be used. The corpus data are annotated using the ELAN tool. At the moment, annotations are created on the levels of glosses and translation.

A Virtual Character based Italian Sign Language Dictionary

Collection of …

This paper presents a novel Italian text to Italian Sign Language Dictionary that displays word translation by means of a virtual character. The Dictionary is linked with MultiWordNet, a lexical and semantic database which includes sev-eral languages. The objective is to use it, as a learning tool for Deaf people to enhance the learning of written languages.

Making an Online Dictionary of New Zealand Sign Language

Lexikos, 2013

The Online Dictionary of New Zealand Sign Language (ODNZSL), 1 launched in 2011, is an example of a contemporary sign language dictionary that leverages the 21st century advantages of a digital medium and an existing body of descriptive research on the language, including a small electronic corpus of New Zealand Sign Language. Innovations in recent online dictionaries of other signed languages informed development of this bilingual, bi-directional, multimedia dictionary. Video content and search capacities in an online medium are a huge advance in more directly representing a signed lexicon and enabling users to access content in versatile ways, yet do not resolve all of the theoretical challenges that face sign language dictionary makers. Considerations in the editing and production of the ODNZSL are discussed in this article, including issues of determining lexemes and word class in a polysynthetic language, deriving usage examples from a small corpus, and dealing with sociolinguistic variation in the selection and performance of content.