Maria Mitrofan - Academia.edu (original) (raw)
Papers by Maria Mitrofan
Universal Dependencies Consortium, Nov 15, 2020
Proceedings of the Biomedical NLP Workshop, Nov 10, 2017
RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning, Nov 10, 2017
Semantic web, Jun 5, 2023
arXiv (Cornell University), Jun 16, 2022
Proceedings of the 21st Workshop on Biomedical Language Processing
The paper presents the quite long-standing tradition of Romanian corpus acquisition and processin... more The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domai... more LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. <br> It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents.<br> Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established). The LegalNERo corpus is available in different formats: span-based, token-based and RDF. <br> The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format. CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html .<br> Part-of-speech tagging was realized using UDPIPE. <br> Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field.<br> Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last).<br> Automati...
The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project.... more The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.
Proceedings of the Natural Legal Language Processing Workshop 2021, 2021
Studies in Informatics and Control, 2020
2019 22nd International Conference on Control Systems and Computer Science (CSCS), 2019
Within the larger project ROBIN, focused on the development of software and services for human-ro... more Within the larger project ROBIN, focused on the development of software and services for human-robot interaction, we present here a set of activities focused on the creation and enhancement of language resources necessary for making dialogue possible between humans and the robot Pepper. More precisely, we describe the preparatory activities for turning the robot Pepper into a dialog partner for a human by using the Romanian language. The language resources that have been created are a lexicon, a language model, and an acoustic language model. They have been enhanced by using other language resources, such as the Romanian wordnet and word embeddings extracted from the large Romanian reference corpus CoRoLa. They will ensure oral communication within some envisaged microworlds, thus they are specific to these microworlds. These resources have been carefully curated and evaluated.
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpo...
Universal Dependencies Consortium, Nov 15, 2020
Proceedings of the Biomedical NLP Workshop, Nov 10, 2017
RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning, Nov 10, 2017
Semantic web, Jun 5, 2023
arXiv (Cornell University), Jun 16, 2022
Proceedings of the 21st Workshop on Biomedical Language Processing
The paper presents the quite long-standing tradition of Romanian corpus acquisition and processin... more The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domai... more LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. <br> It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents.<br> Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established). The LegalNERo corpus is available in different formats: span-based, token-based and RDF. <br> The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format. CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html .<br> Part-of-speech tagging was realized using UDPIPE. <br> Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field.<br> Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last).<br> Automati...
The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project.... more The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.
Proceedings of the Natural Legal Language Processing Workshop 2021, 2021
Studies in Informatics and Control, 2020
2019 22nd International Conference on Control Systems and Computer Science (CSCS), 2019
Within the larger project ROBIN, focused on the development of software and services for human-ro... more Within the larger project ROBIN, focused on the development of software and services for human-robot interaction, we present here a set of activities focused on the creation and enhancement of language resources necessary for making dialogue possible between humans and the robot Pepper. More precisely, we describe the preparatory activities for turning the robot Pepper into a dialog partner for a human by using the Romanian language. The language resources that have been created are a lexicon, a language model, and an acoustic language model. They have been enhanced by using other language resources, such as the Romanian wordnet and word embeddings extracted from the large Romanian reference corpus CoRoLa. They will ensure oral communication within some envisaged microworlds, thus they are specific to these microworlds. These resources have been carefully curated and evaluated.
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpo...