Multilingual semantic resources and parallel corpora in the biomedical domain: the CLEF-ER challenge (original) (raw)
Related papers
2018
Clinical and biomedical text mining research efforts have so far focused mainly on documents written in English. These efforts benefited significantly from the availability, not only of domain-specific components such as a tokenizers or Partof-Speech taggers, but particularly from the access to very large training corpora and terminological resources like UMLS. In order to exploit terminological resources currently restricted to English, it is necessary to promote more systematic translation efforts into other languages, be it manual or by means of machine translation techniques. An initial barrier not only for generating medical machine translation models is the actual identification of relevant datasets that could be exploited to derive glossaries and parallel corpora. Usually relevant datasets weren’t constructed as a language technology resource and thus are often overseen by the natural language processing community. This article describes an exhaustive effort to identify and c...
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
Journal of the American Medical Informatics Association : JAMIA, 2015
To create a multilingual gold-standard corpus for biomedical concept recognition. We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best ann...
BVS Corpus: A Multilingual Parallel Corpus of Biomedical Scientific Texts
ArXiv, 2019
The BVS database (Health Virtual Library) is a centralized source of biomedical information for Latin America and Carib, created in 1998 and coordinated by BIREME (Biblioteca Regional de Medicina) in agreement with the Pan American Health Organization (OPAS). Abstracts are available in English, Spanish, and Portuguese, with a subset in more than one language, thus being a possible source of parallel corpora. In this article, we present the development of parallel corpora from BVS in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for EN/ES and EN/PT language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Neural Machine Translation (OpenNMT) system for each language pair, which outperformed related works on scientific biomedical articles. Sentence alignment was also manually evaluated, presenting an average 96% of correctly aligned sentences across al...
Deriving an English Biomedical Silver Standard Corpus for CLEF-ER
We describe the automatic harmonization method used for building the English Silver Standard annotation supplied as a data source for the multilingual CLEF-ER named entity recognition challenge. The use of an automatic Silver Standard is designed to remove the need for a costly and time-consuming expert annotation. The final voting threshold of 3 for the harmonization of 6 different annotations from the project partners kept 45% of all available concept centroids. On average, 19% (SD 14%) of the original annotations are removed. 97.8% of the partner annotations that go into the Silver Standard Corpus have exactly the same boundaries as their harmonized representations.
Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language
Data
Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotate...
Multilingual named-entity recognition from parallel corpora
We present a named-entity recognition (NER) system for parallel multilingual text. Our system handles three languages (i.e., English, French, and Spanish) and is tailored to the biomedical domain. For each language, we design a supervised knowledge-based CRF model with rich biomedical and general domain information. We use the sentence alignment of the parallel corpora, the word alignment generated by the GIZA++[8] tool, and Wikipedia-based word alignment in order to transfer system predictions made by individual language models to the remaining parallel languages. We retrain each individual language system using the transferred predictions and generate a final enriched NER model for each language. The enriched system performs better than the initial system based on the predictions transferred from the other language systems. Each language model benefits from the external knowledge extracted from biomedical and general domain resources.
Using Multilingual Terms for Biomedical Term Extraction
The goal of automatic term extraction often is not so much the creation of a new list of domain specific terms, but rather the (semi-) automatic extension of a list of known terms. In this paper, we focus on the use of existing terms from glossaries, thesaurus, or ontologies to extract new terms from a domain specific text. Our new method is used to extract language-specific terms with the help of multilingual terminological resources. Our baseline system combines a linguistic pattern for extracting candidate noun phrases with a statistical method (χ 2 ) for ranking candidate phrases according to their association strength in a domain-specific corpus. Our scoring method also takes into account the termhood of candidate phrases computed on the basis of a list of known terms. We show that uninterpolated average precision of the resulting term list is improved when tested using human evaluators.
High quality clean parallel corpora is a must for creating statistical machine translation or neural machine translation systems. Although high quality parallel corpora is largely available for official languages of the European Union, the United Nations and other organization, it is hard to encounter enough amount of open parallel corpora for languages such as Turkish, which, in turn, leads to lower quality Machine Translation for these languages. In this study, we use automatic and semi-automatic procedures to collect and prepare parallel corpora in cardiology domain. We crawl a journal website and obtain 6500 Turkish abstracts and their English translations by using HTTrack. By aligning these abstracts and converting them into a translation memory in a computer-aided translation tool environment, we make it possible to use the corpora for machine translation training as well as term extraction. We argue that new tools integrating and streamlining the web crawling, alignment and cleaning steps are needed in order to support the preparation of parallel corpora for low-resource languages.
Cross-lingual semantic annotation of biomedical literature: experiments in Spanish and English
Bioinformatics, 2019
Motivation Biomedical literature is one of the most relevant sources of information for knowledge mining in the field of Bioinformatics. In spite of English being the most widely addressed language in the field; in recent years, there has been a growing interest from the natural language processing community in dealing with languages other than English. However, the availability of language resources and tools for appropriate treatment of non-English texts is lacking behind. Our research is concerned with the semantic annotation of biomedical texts in the Spanish language, which can be considered an under-resourced language where biomedical text processing is concerned. Results We have carried out experiments to assess the effectiveness of several methods for the automatic annotation of biomedical texts in Spanish. One approach is based on the linguistic analysis of Spanish texts and their annotation using an information retrieval and concept disambiguation approach. A second method...