Synchronized Mediawiki based analyzer dictionary development (original) (raw)

Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

2017

Open-source analyzer dictionary development is being implemented for Skolt Sami, Ingrian, Moksha-Mordvin, etc. in the Helsinki CSC infrastructure; home of the Finnish Kielipankki 'Language Bank' and Termipankki 'Term Bank'. The proximity of minority-language corpora in need of annotation and the multiple usage of controlled wikimedia-type dictionaries make CSC an attractive site for synchronized transducer dictionary development. The open-source FST development of Uralic and other minority languages at Giellatekno-Divvun in Tromsø demonstrates a vast potential for reusage of FST-s, only augmented by opensource work in OmorFi, Apertium and Universal Dependency http://universaldependencies.org/#language-urj. The initial idea is to allow synchronized editing of Giellatekno XML and CSC Wiki structures via github. In addition to allowing for simple lexc LEMMA:STEM CONTINUATION_LEXICON "TRANS-LATION" ; line exports, the parallel dictionaries will provide for documentation of derivation, morpho-syntactic information on valency and government, semantics and etymology. 1 Introduction Open-source finite-state transducer development and application as we know it today in the Giellatekno infrastructure 1 at Tromsø, Norway dates back to the early 1990s. It begins with the morphological description of different Sami languages, grammatical analysis and syntax. Morphological and morphosyntactic description lays the foundation for tool building, such as Divvun 2 , and working solutions attract soft coding for the application of research and tools to other languages. Open-source compilers from HFST 3 in Helsinki, are gradually worked into the infrastructure after 2008. With the growth of the research community and tool building diversity comes the practicality of reusable resources, descriptions and testing formalisms. As things improve the number of uses and users also increases. One resource, in particular, is the four-fold combination of lemma, stem, pos/continuation lexicon and gloss, afforded by many of the language projects at Giellatekno in lexc code. Lexc code containing multiple nodes of information can be stored in XML files for xsl transformation and project-specific transducer construction. The need for multiple transducers presents itself when tagging strategies are not shared by the multiple projects of a given language. Although the two-level model may be sufficient for all projects, normative labeling, morphosyntactic information and semantic tag needs will vary for TTS, MT, ICALL, spellcheckers and other morphological analyzer projects. The Giellatekno-Divvun strategy has been to utilize one lexc and twolc in all North Sami projects with filters for selecting the necessary code. For this multiple and iterated use of resources, on the contrary, the solution presents itself in XML-format analyzer dictionaries with multiple xsl transformations and the possibility to keep up with semantic wiki strategies being adopted in open wikimedia projects 4. At CSC in Helsinki, Finland, a sanat-server has been put into operation and provides access, initially, to the Kielipankki 5 (Finnish Language Bank) wordnet and Ludic dictionary development. The Ludic dictionary provides for Finnish-Ludic and Russian-Ludic documentation of the Ludic language in a Wikimedia environment. The Wikimedia environment is a sibling of what is used for facilitating Termipankki 6 (the Finnish Term Bank); domains and subdomains can be established for administrating access and editing rights. This type of environment is desirable for synchronic editing strategies involving XML and php input. Unlike wiktionary and wikipedia 1

Advances in synchronized XML-MediaWiki dictionary development in the context of endangered Uralic languages

Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, 2018

We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated for non-technical users due to its strict syntax that has to be maintained valid at all times. Our system solves these problems by making a synchronized editing of the same dictionary data possible both in a MediaWiki environment and XML files in an easy fashion. In addition, we describe how the dictionary knowledge in the MediaWiki-based dictionary can be enhanced by an additional Semantic MediaWiki layer for more effective searches in the data. In addition, an API access to the lexical information in the dictionary and morphological tools in the form of an open source Python library is presented.

THE LIVONIAN-ESTONIAN-LATVIAN DICTIONARY AS A THRESHOLD TO THE ERA OF LANGUAGE TECHNOLOGICAL APPLICATIONS

This article outlines the multiple use of electronic source materials from the Livonian-Estonian-Latvian Dictionary of 2012 in a “Kone Foundation” funded project for developing finite-state morphological parsers. It provides an introduction to the project, the language-independent Giellatekno infrastructure at Tromsø, Norway, and the materials utilized in the electronic manuscript of the dictionary. The introduction is followed by an extensive description of what has been developed on the Giellatekno infrastructure with explicit indications of where parallel projects might be initiated.

ISLEX – a Multilingual Web Dictionary

2014

ISLEX is a multilingual Scandinavian dictionary, with Icelandic as a source language and Danish, Norwegian, Swedish, Faroese and Finnish as target languages. Within ISLEX are in fact contained several independent, bilingual dictionaries. While Faroese and Finnish are still under construction, the other languages were opened to the public on the web in November 2011. The use of the dictionary is free of charge and it has been extremely well received by its users. The result of the project is threefold. Firstly, some long awaited Icelandic-Scandinavian dictionaries have been published on the digital medium. Secondly, the project has been an important experience in Nordic language collaboration by jointly building such a work in six countries simultaneously, by academic institutions in Iceland, Denmark, Norway, Sweden, The Faroe Islands and Finland. Thirdly, the work has resulted in a compilation of structured linguistic data of the Nordic languages. This data is suitable for use in further lexicographic work and in various language technology projects.

On XML-MediaWiki Resources, Endangered Languages and TEI Compatibility, Multilingual Dictionaries For Endangered Languages

AsiaLex 2019 : Proceedings of the 13th Conference of the Asian Association for Lexicography, 2019

In this paper, we identify the need for a standardized formalism for the structured XML dictionaries of endangered Uralic languages in the Giella infrastructure. For this purpose, we have decided to use TEI formalism as it is a standardized way of representing data and its commonly used in the field of lexicography. This paper focuses on describing the issues and challenges faced in the conversion of the Giella XML into TEI. A full conversion scheme is introduced in this paper contrasting the peculiarities of the two XML formalisms. We incorporate the new TEI-based XML structure into our existing online dictionary system as an output format.

On Editing Dictionaries for Uralic Languages in an Online Environment

Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages, 2020

We present an open online infrastructure for editing and visualization of dictionaries of different Uralic languages (e.g. Erzya, Moksha, Skolt Sami and Komi-Zyrian). Our infrastructure integrates fully into the existing Giellatekno one in terms of XML dictionaries and FST morphology. Our code is open source, and the system is being actively used in editing a Skolt Sami dictionary set to be published in 2020. Abstract Tämä artikkeli esittelee Uralilaisten kielten (kuten ersän, mokshan, koltansaamen ja komi-syrjäänin) sanakirjojen toimit-tamiseen ja visualisointiin tarkoitetun avoimen verkkoinfrastruktuurin. Mei-dän infrastruktuurimme integroituu Giellateknoon XML-sanakirjojen ja FST-morfologian osalta. Lähdekoodimme on avointa, ja järjestelmäämme käytetään tällä hetkellä aktiivisesti koltansaamen sanakirjan toimitustyössä. Koltan sanakirja julkaistaan vuonna 2020.

Automatic creation of bilingual dictionaries for Finno-Ugric languages

Septentrio Conference Series, 2015

We introduce an ongoing project whose objective is to provide linguistically based support for several small Finno-Ugric digital communities in generating online content. To achieve our goals, we collect parallel, comparable and monolingual text material for the following Finno-Ugric (FU) languages: Komi-Zyrian and Permyak, Udmurt, Meadow and Hill Mari and Northern Sami, as well as for major languages that are of interest to the FU community: English, Russian, Finnish and Hungarian. Our goal is to generate proto-dictionaries for the mentioned language pairs and deploy the enriched lexical material on the web in the framework of the collaborative dictionary project Wiktionary. In addition, we will make all of the project’s products (corpora, models, dictionaries) freely available supporting further research.

ISLEX – a Multilingual Nordic Web Dictionary

ISLEX is a multilingual Scandinavian dictionary, with Icelandic as a source language and Danish, Norwegian, Swedish, Faroese and Finnish as target languages. Within ISLEX are in fact contained several independent, bilingual dictionaries. While Faroese and Finnish are still under construction, the other languages were opened to the public on the web in November 2011. The use of the dictionary is free of charge and it has been extremely well received by its users. The result of the project is threefold. Firstly, some long awaited Icelandic-Scandinavian dictionaries have been published on the digital medium. Secondly, the project has been an important experience in Nordic language collaboration by jointly building such a work in six countries simultaneously, by academic institutions in Iceland, Denmark, Norway, Sweden, The Faroe Islands and Finland. Thirdly, the work has resulted in a compilation of structured linguistic data of the Nordic languages. This data is suitable for use in further lexicographic work and in various language technology projects.

ISLEX—An Icelandic-Scandinavian Multilingual Online Dictionary

2008

This paper presents ISLEX, an inter-Nordic project based in Reykjavik, Iceland, with partners in Gothenburg, Bergen and Copenhagen. The aim of the project is to develop an online dictionary site with Icelandic as the source language and the three Scandinavian languages- Swedish, Norwegian (with two official standards) and Danish-as the target languages. The dictionary is planned to contain 50,000 lemmas, with a development period of six years. In 2011, or possibly sooner, the site will be publicly available on the Internet, free of charge. In this article, the main features of the project are presented with particular emphasis on database design, editorial principles and priorities.

Building an open-source development infrastructure for language technology projects

This article presents a novel way of combining finite-state transducers (FSTs) with electronic dictionaries, thereby creating efficient reading comprehension dictionaries. We compare a North Saami - Norwegian and a South Saami - Norwegian dictionary, both enriched with an FST, with existing, available dictionaries containing pre-generated paradigms, and show the advantages of our approach. Being more flexible, the FSTs may also adjust the dictionary to different contexts. The finite state transducer analyses the word to be looked up, and the dictionary itself conducts the actual lookup. The FST part is crucial for morphology-rich languages, where as little as 10% of the wordforms in running text actually consists of lemma forms. If a compound or derived word, or a word with an enclitic particle is not found in the dictionary, the FST will give the stems and derivation affixes of the wordform, and each of the stems will be given a separate translation. In this way, the coverage of th...