One format to rule them all -- The emtsv pipeline for Hungarian (original) (raw)
Related papers
E-magyar -- A Digital Language Processing System
e-magyar is a new toolset for the analysis of Hungarian texts. It was produced as a collaborative effort of the Hungarian language technology community integrating the best state of the art tools, enhancing them where necessary, making them interoperable and releasing them with a clear license. It is a free, open, modular text processing pipeline which is integrated in the GATE system offering further prospects of interoperability. From tokenizing to parsing and named entity recognition, existing tools were examined and those selected for integration underwent various amount of overhaul in order to operate in the pipeline with a uniform encoding, and run in the same Java platform. The tokenizer was re-built from ground up and the flagship module, the morphological analyzer, based on the Humor system, was given a new annotation system and was implemented in the HFST framework. The system is aimed for a broad range of users, from language technology application developers to digital humanities researchers alike. It comes with a drag-and-drop demo on its website: http://e-magyar.hu/en/
Report on the Hungarian Language
Project Deliverable, 2022
In the framework of the European Language Equality (ELE) project, the present paper gives a qualitative overview of the current situation of Hungarian Natural Language Processing (NLP). The project’s main objectives are to provide a comprehensive landscape of the Hungarian NLP scene by compiling a roadmap of existing language technology tools and datasets in Hungarian, to identify the major gaps in present day national language technologies in the EU as of September 2021, and to determine the essential directions in research and technology. This is part of a joint pan-European effort that will impact the field of language technology (LT) in Europe for the next 10-15 years, including prospective funding. The large-scale language technology data collection process has aimed at cataloguing, to the largest possible extent, all corpora, lexical and conceptual resources, tools, grammars, and language models available for the Hungarian language as of September 2021. The data collection process was carried out by the Hungarian Research Centre for Linguistics as part of the European Language Equality (ELE) project. Altogether we collected 344 datasets and 180 tools and language models. We hope that our results may be of use for the Hungarian NLP community. The detailed database of the 500+ language technology resources we identified are available for stakeholders online on the ELG website. This work, together with other ELE partner institutions covering over 30 languages in European countries, serves as the basis for a comprehensive proposal and a roadmap for achieving digital language equality in Europe by 2030. So far, there has been only one study of a similar scope for Hungarian LT. In 2012, the META-NET network and its partner institutions compiled a comprehensive survey of their languages in terms of their LT support and published their findings in a series of White Papers. The present paper is a summary of this new survey that can be considered an update of the book The Hungarian Language in the Digital Age (Simon et al., 2012) that was published in the META-NET White Papers series. In almost a decade that has passed since the publication of Simon and colleagues’ work, LT as a field has undergone revolutionary innovations as statistical methods have been abandoned in favour of neural networks. As a result, LT has found its way into our everyday life – we wish to capture these changes as well.
NORMO: An Automatic Normalization Tool for Middle Hungarian
The paper presents NORMO, an automatic normalization tool for Middle Hungarian texts with a memory-based and a rule-based module, which consists of character-and token level rewrite rules. The automatically normalized text eases and shortens the manual normalization work and results an edible output for further NLP tools. After exposing the modules of NORMO, we provide a thorough evaluation of the modules and the entire system, and we compare its performance to that of similar tools as well.
Introducing NYTK-NerKor, a Gold Standard Hungarian Named Entity Annotated Corpus
TSD2021, 2021
Here we present NYTK-NerKor, a gold standard Hungarian named entity annotated corpus containing 1 million tokens. This is the largest corpus ever in its kind. It contains balanced text selection from five genres: fiction, legal, news, web, and Wikipedia. A ca. 200,000 tokens subcorpus contains gold standard morphological annotation besides NE labels. We provide official train, development and test datasets in a proportion of 80%-10%-10%. All sets provide a balanced selection from all genres and sources, while the morphologically annotated subcorpus is also represented in all sets in a balanced way. The format of data files are CoNLL-U Plus, in which the NE annotation follows the CoNLL2002 labelling standard, while morphological information is encoded using the well-known Universal Dependencies POS tags and morphosyntactic features. The novelty of NYTK-NerKor as opposed to similar existing corpora is that it is: by an order of magnitude larger, freely available for any purposes, containing text material from different genres and sources, and following international standards in its format and tagset. The corpus is available under the license CC-BY-SA 4.
Easily accessible language technologies for Slovene, Croatian and Serbian
2016
In this paper we present the pipeline of recently developed language technology tools for Slovene, Croatian and Serbian. They currently cover text segmentation, text normalisation, part-of-speech tagging, lemmatisation and inflectional lexicon lookup. Most rely on machine learning approaches, such as statistical machine translation and conditional random fields, capable of producing high-quality models for the phenomenon covered. Special emphasis is put on easy accessibility of these tools by offering them and the trained models for all three languages as (1) open source via public git repositories and (2) online in the form of web applications and web services
PolEval 2019 — the next chapter in evaluating Natural Language Processing tools for Polish
2019
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-the-art in Polish language processing in the respective tasks. In 2019 we have organized six different tasks, creating an even greater opportunity for NLP researchers to evaluate their systems in an objective manner.