Maria Mitrofan - Academia.edu (original) (raw)

Papers by Maria Mitrofan

Research paper thumbnail of Bootstrapping a Romanian Corpus for Medical Named Entity Recognition

Research paper thumbnail of Universal Dependencies 2.7

Universal Dependencies Consortium, Nov 15, 2020

Research paper thumbnail of Adapting the TTL Romanian POS Tagger to the Biomedical Domain

Proceedings of the Biomedical NLP Workshop, Nov 10, 2017

Research paper thumbnail of Bootstrapping a Romanian Corpus for Medical Named Entity Recognition

RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning, Nov 10, 2017

Research paper thumbnail of LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain

Semantic web, Jun 5, 2023

Research paper thumbnail of Adapting the TTL Romanian POS Tagger to the Biomedical Domain

Research paper thumbnail of Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Research paper thumbnail of RACAI’s System at PharmaCoNER 2019

Research paper thumbnail of Assessing multiple word embeddings for named entity recognition of professions and occupations in health-related social media

Research paper thumbnail of Human-Machine Interaction Speech Corpus from the ROBIN project

Research paper thumbnail of An Open-Domain QA System for e-Governance

arXiv (Cornell University), Jun 16, 2022

Research paper thumbnail of Improving Romanian BioNER Using a Biologically Inspired System

Proceedings of the 21st Workshop on Biomedical Language Processing

Research paper thumbnail of Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian

The paper presents the quite long-standing tradition of Romanian corpus acquisition and processin... more The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data

Research paper thumbnail of Workshop on Deep Learning and Neural Approaches for Linguistic Data - Book of abstracts

Research paper thumbnail of Romanian Named Entity Recognition in the Legal domain (LegalNERo)

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domai... more LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. <br> It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents.<br> Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established). The LegalNERo corpus is available in different formats: span-based, token-based and RDF. <br> The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format. CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html .<br> Part-of-speech tagging was realized using UDPIPE. <br> Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field.<br> Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last).<br> Automati...

Research paper thumbnail of ROBIN Technical Acquisition Speech Corpus

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project.... more The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.

Research paper thumbnail of Named Entity Recognition in the Romanian Legal Domain

Proceedings of the Natural Legal Language Processing Workshop 2021, 2021

Research paper thumbnail of A Dialog Manager for Micro-Worlds

Studies in Informatics and Control, 2020

Research paper thumbnail of Making Pepper Understand and Respond in Romanian

2019 22nd International Conference on Control Systems and Computer Science (CSCS), 2019

Within the larger project ROBIN, focused on the development of software and services for human-ro... more Within the larger project ROBIN, focused on the development of software and services for human-robot interaction, we present here a set of activities focused on the creation and enhancement of language resources necessary for making dialogue possible between humans and the robot Pepper. More precisely, we describe the preparatory activities for turning the robot Pepper into a dialog partner for a human by using the Romanian language. The language resources that have been created are a lexicon, a language model, and an acoustic language model. They have been enhanced by using other language resources, such as the Romanian wordnet and word embeddings extracted from the large Romanian reference corpus CoRoLa. They will ensure oral communication within some envisaged microworlds, thus they are specific to these microworlds. These resources have been carefully curated and evaluated.

Research paper thumbnail of The MARCELL Legislative Corpus

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpo...

Research paper thumbnail of Bootstrapping a Romanian Corpus for Medical Named Entity Recognition

Research paper thumbnail of Universal Dependencies 2.7

Universal Dependencies Consortium, Nov 15, 2020

Research paper thumbnail of Adapting the TTL Romanian POS Tagger to the Biomedical Domain

Proceedings of the Biomedical NLP Workshop, Nov 10, 2017

Research paper thumbnail of Bootstrapping a Romanian Corpus for Medical Named Entity Recognition

RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning, Nov 10, 2017

Research paper thumbnail of LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain

Semantic web, Jun 5, 2023

Research paper thumbnail of Adapting the TTL Romanian POS Tagger to the Biomedical Domain

Research paper thumbnail of Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Research paper thumbnail of RACAI’s System at PharmaCoNER 2019

Research paper thumbnail of Assessing multiple word embeddings for named entity recognition of professions and occupations in health-related social media

Research paper thumbnail of Human-Machine Interaction Speech Corpus from the ROBIN project

Research paper thumbnail of An Open-Domain QA System for e-Governance

arXiv (Cornell University), Jun 16, 2022

Research paper thumbnail of Improving Romanian BioNER Using a Biologically Inspired System

Proceedings of the 21st Workshop on Biomedical Language Processing

Research paper thumbnail of Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian

The paper presents the quite long-standing tradition of Romanian corpus acquisition and processin... more The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data

Research paper thumbnail of Workshop on Deep Learning and Neural Approaches for Linguistic Data - Book of abstracts

Research paper thumbnail of Romanian Named Entity Recognition in the Legal domain (LegalNERo)

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domai... more LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. <br> It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents.<br> Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established). The LegalNERo corpus is available in different formats: span-based, token-based and RDF. <br> The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format. CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html .<br> Part-of-speech tagging was realized using UDPIPE. <br> Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field.<br> Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last).<br> Automati...

Research paper thumbnail of ROBIN Technical Acquisition Speech Corpus

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project.... more The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.

Research paper thumbnail of Named Entity Recognition in the Romanian Legal Domain

Proceedings of the Natural Legal Language Processing Workshop 2021, 2021

Research paper thumbnail of A Dialog Manager for Micro-Worlds

Studies in Informatics and Control, 2020

Research paper thumbnail of Making Pepper Understand and Respond in Romanian

2019 22nd International Conference on Control Systems and Computer Science (CSCS), 2019

Within the larger project ROBIN, focused on the development of software and services for human-ro... more Within the larger project ROBIN, focused on the development of software and services for human-robot interaction, we present here a set of activities focused on the creation and enhancement of language resources necessary for making dialogue possible between humans and the robot Pepper. More precisely, we describe the preparatory activities for turning the robot Pepper into a dialog partner for a human by using the Romanian language. The language resources that have been created are a lexicon, a language model, and an acoustic language model. They have been enhanced by using other language resources, such as the Romanian wordnet and word embeddings extracted from the large Romanian reference corpus CoRoLa. They will ensure oral communication within some envisaged microworlds, thus they are specific to these microworlds. These resources have been carefully curated and evaluated.

Research paper thumbnail of The MARCELL Legislative Corpus

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpo...