Marta Villegas - Academia.edu (original) (raw)
Papers by Marta Villegas
Cornell University - arXiv, Sep 16, 2021
We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date... more We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license in Zenodo (
These Catalan sub-word embeddings in FastText using BPE have been generated from the largest corp... more These Catalan sub-word embeddings in FastText using BPE have been generated from the largest corpus ever made in Catalan till the date. The corpus has more than 10Gb of curated high quality text. If this material is useful, please cite it. Copyright (c) 2021 Text Mining Unit - Barcelona Supercomputing Center
<strong>Spanish Biomedical Sub-word Embeddings in FastText</strong> These embeddings ... more <strong>Spanish Biomedical Sub-word Embeddings in FastText</strong> These embeddings have been generated from the largest corpus ever made from Spanish Biomedical resources till the date. <strong>License</strong> Creative Commons Attribution 4.0 International License. Copyright (c) 2021 Text Mining Unit - Barcelona Supercomputing Center
<strong>Spanish Clinical Word Embeddings in FastText</strong> These embeddings have b... more <strong>Spanish Clinical Word Embeddings in FastText</strong> These embeddings have been generated from the largest corpus ever made from Spanish Clinical resources till the date. <strong>License</strong> Creative Commons Attribution 4.0 International License. Copyright (c) 2021 Text Mining Unit - Barcelona Supercomputing Center
<strong>ATC7 codes Spanish-English translations</strong> This repository contains Spa... more <strong>ATC7 codes Spanish-English translations</strong> This repository contains Spanish-English translations of the medical ATC7 codes. <strong>MIT License</strong><br> Copyright (c) 2021 Secretaría de Estado de Digitalización e Inteligencia Artificial (SEDIA)
[Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that includes: ... more [Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that includes: (a) the full-text in Spanish available in Scielo.org (until December/2018), (b) all articles from the following Wikipedia categories: Pharmacology, Pharmacy, Medicine and Biology (during December/2018) and (c) the concatenation of the previous two corpora. To generate the word embedding two different approaches were used: Word2Vec and fastText.
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the... more The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the web. It has been obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government during September and October 2020. It consists of 39.117.909 tokens, 1.565.433 sentences and 71.043 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus. We license the actual packaging of this data under a CC0 1.0 Universal License.
A common characteristic of content generated by healthcare professionals, regardless the actual c... more A common characteristic of content generated by healthcare professionals, regardless the actual clinical discipline or language, is the widespread and frequent use of abbreviations, acronyms, telegraphic phrases and shorthand notes. Despite the well-known issues related to the ambiguity and misinterpretation of abbreviations, their use in practice is required to simplify and enable communication-avoiding repetition of long complex specialized medical terminologies. Moreover, clinical texts typically do not provide explicit abbreviation definitions. Thus the performance of clinical natural language processing and text mining systems is significantly affected by the previous recognition and definition resolution of medical abbreviations. To promote the development of such key components, we have organized the second Biomedical Abbreviation Recognition and Resolution (BARR2) track. The overall aim of this effort was to evaluate strategies for detecting automatically mentions of abbrevi...
XVI Jornadas Nacionales de Informacion y Documentacion en Ciencias de la Salud. Oviedo, 4-5 de ab... more XVI Jornadas Nacionales de Informacion y Documentacion en Ciencias de la Salud. Oviedo, 4-5 de abril de 2019
Lecture Notes in Computer Science, 2020
This paper describes the eighth edition of the BioASQ Challenge, which will run as an evaluation ... more This paper describes the eighth edition of the BioASQ Challenge, which will run as an evaluation Lab in the context of CLEF2020. The aim of BioASQ is the promotion of systems and methods for highly precise biomedical information access. This is done through the organization of a series of challenges (shared tasks) on large-scale biomedical semantic indexing and question answering, where different teams develop systems that compete on the same demanding benchmark datasets that represent the real information needs of biomedical experts. In order to facilitate this information finding process, the BioASQ challenge introduced two complementary tasks: (a) the automated indexing of large volumes of unlabelled data, primarily scientific articles, with biomedical concepts, (b) the processing of biomedical questions and the generation of comprehensible answers. Rewarding the most competitive systems that outperform the state of the art, BioASQ manages to push the research frontier towards ensuring that the biomedical experts will have direct access to valuable knowledge.
The research reported in this paper is part of the activities carried out within the CLARIN 1 (Co... more The research reported in this paper is part of the activities carried out within the CLARIN 1 (Common Language Resources and Technology Infrastructure) project. CLARIN is a large-scale pan-European project to create, coordinate and make language resources and technology available and readily useable. CLARIN is devoted to the creation of a persistent and stable infrastructure serving the needs of the European Humanities and Social Sciences (HSS) research community. We present a real case, in the field of discourse analysis of newspapers, which demonstrates the impact of digital methods in the humanities. More exactly, we will describe our collaboration with the Feminario research group from the UAB 2. This group has been investigating androcentric practices in Spanish general press since the 80s and their research suggests that Spanish general press has undergone a dehumanization process that excludes women and men as if their contributions were insignificant for social functioning. ...
This paper describes ongoing work for the construction of a new treebank for Spanish, The IULA Tr... more This paper describes ongoing work for the construction of a new treebank for Spanish, The IULA Treebank. This new resource will contain about 60,000 richly annotated sentences as an extension of the already existing IULA Technical Corpus which is only PoS tagged. In this paper we have focused on describing the work done for defining the annotation process and the treebank design principles. We report on how the used framework, the DELPH-IN processing framework, has been crucial in the design principles and in the bootstrapping strategy followed, especially in what refers to the use of stochastic modules for reducing parsing overgeneration. We also report on the different evaluation experiments carried out to guarantee the quality of the already available results.
[Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that include: (... more [Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that include: (a) the full-text in Spanish available in SciELO.org (until December/2018), (b) all articles from the following Wikipedia categories: Pharmacology, Pharmacy, Medicine and Biology (during December/2018) and (c) the concatenation of the previous two corpora. We used fastText to train the word embeddings. For more information, we refer to the corresponding article: https://www.aclweb.org/anthology/W19-1916/
International Conference on Language Resources and Evaluation, 2002
This paper describes the Lexicographic Station Development Platform and how it has been used to i... more This paper describes the Lexicographic Station Development Platform and how it has been used to implement the resulting lexicon guidelines and standards generated by ISLE Computational Lexicon Group in a prototype tool for lexical encoding. The aims of the work described here were to (i) exemplify and disseminate the Multilingual ISLE Lexical Entry (MILE) using an actual model and available monolingual data (ii) make extensive use of already existing PAROLE and SIMPLE lexicons and (iii) to eventually test the goodness of the guidelines by using a real scenario. To cope with these aims, the LSDP was designed as a tool generator which could automatically generate a prototype lexicographic station out of ISLE guidelines when formally expressed in a DTD. Thus, we have tested and exemplified the recommendations expressed in MILE but in addition we have also proved that MILE can be implemented on already existing monolingual resources.
Revista de Procesamiento de Lenguaje Natural (SEPLN), 2017
Si bien se han hecho esfuerzos considerables para aplicar las tecnologias de mineria de texto a l... more Si bien se han hecho esfuerzos considerables para aplicar las tecnologias de mineria de texto a la literatura biomedica y los registros clinicos escritos en ingles, lo cierto es que intentos de procesar documentos en otros idiomas han atraido mucha menos atencion a pesar de su interes practico. Debido al considerable numero de documentos biomedicos escritos en espanol, existe una necesidad apremiante de poder acceder a los recursos de mineria de textos biomedicos y clinicos desarrollados para esta lengua de alto impacto. Para abordar este asunto, la Secretaria de Estado encargo las actuaciones de apoyo tecnico especializado para el desarrollo del Plan de Impulso de las tecnologias del Lenguaje en el ambito de la biomedicina. El articulo describe brevemente las lineas principales de actuacion del proyecto en su primera fase, esto es: facilitar el acceso a recursos y herramientas en PNL, analizar y garantizar la interoperabilidad del sistema, la definicion de metodos y herramientas de evaluacion, la difusion del proyecto y sus resultados y la alineacion y colaboracion con otros proyectos nacionales e internacionales. Ademas, hemos identificado algunas de las tareas criticas en el procesamiento de textos biomedicos que requieren investigacion adicional y disponibilidad de herramientas.
This paper describes the Lexicographic Station Development Platform and how it has been used to i... more This paper describes the Lexicographic Station Development Platform and how it has been used to implement the resulting lexicon guidelines and standards generated by ISLE Computational Lexicon Group in a prototype tool for lexical encoding. The aims of the work described here were to (i) exemplify and disseminate the Multilingual ISLE Lexical Entry (MILE) using an actual model and available monolingual data (ii) make extensive use of already existing PAROLE and SIMPLE lexicons and (iii) to eventually test the goodness of the guidelines by using a real scenario. To cope with these aims, the LSDP was designed as a tool generator which could automatically generate a prototype lexicographic station out of ISLE guidelines when formally expressed in a DTD. Thus, we have tested and exemplified the recommendations expressed in MILE but in addition we have also proved that MILE can be implemented on already existing monolingual resources.
This paper reports our experience when integrating differ resources and services into a grid envi... more This paper reports our experience when integrating differ resources and services into a grid environment. The use case we address implies the deployment of several NLP applications as web services. The ultimate objective of this task was to create a scenario where researchers have access to a variety of services they can operate. These services should be easy to invoke and able to interoperate between one another. We essentially describe the interoperability problems we faced, which involve metadata interoperability, data interoperability and service interoperability. We devote special attention to service interoperability and explore the possibility to define common interfaces and semantic description of services. While the web services paradigm suits the integration of different services very well, this requires mutual understanding and the accommodation to common interfaces that not only provide technical solution but also ease the user‟s work. Defining common interfaces benefits...
Abstract. This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which ... more Abstract. This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available. Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation. 1
Cornell University - arXiv, Sep 16, 2021
We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date... more We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license in Zenodo (
These Catalan sub-word embeddings in FastText using BPE have been generated from the largest corp... more These Catalan sub-word embeddings in FastText using BPE have been generated from the largest corpus ever made in Catalan till the date. The corpus has more than 10Gb of curated high quality text. If this material is useful, please cite it. Copyright (c) 2021 Text Mining Unit - Barcelona Supercomputing Center
<strong>Spanish Biomedical Sub-word Embeddings in FastText</strong> These embeddings ... more <strong>Spanish Biomedical Sub-word Embeddings in FastText</strong> These embeddings have been generated from the largest corpus ever made from Spanish Biomedical resources till the date. <strong>License</strong> Creative Commons Attribution 4.0 International License. Copyright (c) 2021 Text Mining Unit - Barcelona Supercomputing Center
<strong>Spanish Clinical Word Embeddings in FastText</strong> These embeddings have b... more <strong>Spanish Clinical Word Embeddings in FastText</strong> These embeddings have been generated from the largest corpus ever made from Spanish Clinical resources till the date. <strong>License</strong> Creative Commons Attribution 4.0 International License. Copyright (c) 2021 Text Mining Unit - Barcelona Supercomputing Center
<strong>ATC7 codes Spanish-English translations</strong> This repository contains Spa... more <strong>ATC7 codes Spanish-English translations</strong> This repository contains Spanish-English translations of the medical ATC7 codes. <strong>MIT License</strong><br> Copyright (c) 2021 Secretaría de Estado de Digitalización e Inteligencia Artificial (SEDIA)
[Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that includes: ... more [Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that includes: (a) the full-text in Spanish available in Scielo.org (until December/2018), (b) all articles from the following Wikipedia categories: Pharmacology, Pharmacy, Medicine and Biology (during December/2018) and (c) the concatenation of the previous two corpora. To generate the word embedding two different approaches were used: Word2Vec and fastText.
The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the... more The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the web. It has been obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government during September and October 2020. It consists of 39.117.909 tokens, 1.565.433 sentences and 71.043 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus. We license the actual packaging of this data under a CC0 1.0 Universal License.
A common characteristic of content generated by healthcare professionals, regardless the actual c... more A common characteristic of content generated by healthcare professionals, regardless the actual clinical discipline or language, is the widespread and frequent use of abbreviations, acronyms, telegraphic phrases and shorthand notes. Despite the well-known issues related to the ambiguity and misinterpretation of abbreviations, their use in practice is required to simplify and enable communication-avoiding repetition of long complex specialized medical terminologies. Moreover, clinical texts typically do not provide explicit abbreviation definitions. Thus the performance of clinical natural language processing and text mining systems is significantly affected by the previous recognition and definition resolution of medical abbreviations. To promote the development of such key components, we have organized the second Biomedical Abbreviation Recognition and Resolution (BARR2) track. The overall aim of this effort was to evaluate strategies for detecting automatically mentions of abbrevi...
XVI Jornadas Nacionales de Informacion y Documentacion en Ciencias de la Salud. Oviedo, 4-5 de ab... more XVI Jornadas Nacionales de Informacion y Documentacion en Ciencias de la Salud. Oviedo, 4-5 de abril de 2019
Lecture Notes in Computer Science, 2020
This paper describes the eighth edition of the BioASQ Challenge, which will run as an evaluation ... more This paper describes the eighth edition of the BioASQ Challenge, which will run as an evaluation Lab in the context of CLEF2020. The aim of BioASQ is the promotion of systems and methods for highly precise biomedical information access. This is done through the organization of a series of challenges (shared tasks) on large-scale biomedical semantic indexing and question answering, where different teams develop systems that compete on the same demanding benchmark datasets that represent the real information needs of biomedical experts. In order to facilitate this information finding process, the BioASQ challenge introduced two complementary tasks: (a) the automated indexing of large volumes of unlabelled data, primarily scientific articles, with biomedical concepts, (b) the processing of biomedical questions and the generation of comprehensible answers. Rewarding the most competitive systems that outperform the state of the art, BioASQ manages to push the research frontier towards ensuring that the biomedical experts will have direct access to valuable knowledge.
The research reported in this paper is part of the activities carried out within the CLARIN 1 (Co... more The research reported in this paper is part of the activities carried out within the CLARIN 1 (Common Language Resources and Technology Infrastructure) project. CLARIN is a large-scale pan-European project to create, coordinate and make language resources and technology available and readily useable. CLARIN is devoted to the creation of a persistent and stable infrastructure serving the needs of the European Humanities and Social Sciences (HSS) research community. We present a real case, in the field of discourse analysis of newspapers, which demonstrates the impact of digital methods in the humanities. More exactly, we will describe our collaboration with the Feminario research group from the UAB 2. This group has been investigating androcentric practices in Spanish general press since the 80s and their research suggests that Spanish general press has undergone a dehumanization process that excludes women and men as if their contributions were insignificant for social functioning. ...
This paper describes ongoing work for the construction of a new treebank for Spanish, The IULA Tr... more This paper describes ongoing work for the construction of a new treebank for Spanish, The IULA Treebank. This new resource will contain about 60,000 richly annotated sentences as an extension of the already existing IULA Technical Corpus which is only PoS tagged. In this paper we have focused on describing the work done for defining the annotation process and the treebank design principles. We report on how the used framework, the DELPH-IN processing framework, has been crucial in the design principles and in the bootstrapping strategy followed, especially in what refers to the use of stochastic modules for reducing parsing overgeneration. We also report on the different evaluation experiments carried out to guarantee the quality of the already available results.
[Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that include: (... more [Plan TL/medicine/word embeddings] Word embeddings generated from Spanish corpora that include: (a) the full-text in Spanish available in SciELO.org (until December/2018), (b) all articles from the following Wikipedia categories: Pharmacology, Pharmacy, Medicine and Biology (during December/2018) and (c) the concatenation of the previous two corpora. We used fastText to train the word embeddings. For more information, we refer to the corresponding article: https://www.aclweb.org/anthology/W19-1916/
International Conference on Language Resources and Evaluation, 2002
This paper describes the Lexicographic Station Development Platform and how it has been used to i... more This paper describes the Lexicographic Station Development Platform and how it has been used to implement the resulting lexicon guidelines and standards generated by ISLE Computational Lexicon Group in a prototype tool for lexical encoding. The aims of the work described here were to (i) exemplify and disseminate the Multilingual ISLE Lexical Entry (MILE) using an actual model and available monolingual data (ii) make extensive use of already existing PAROLE and SIMPLE lexicons and (iii) to eventually test the goodness of the guidelines by using a real scenario. To cope with these aims, the LSDP was designed as a tool generator which could automatically generate a prototype lexicographic station out of ISLE guidelines when formally expressed in a DTD. Thus, we have tested and exemplified the recommendations expressed in MILE but in addition we have also proved that MILE can be implemented on already existing monolingual resources.
Revista de Procesamiento de Lenguaje Natural (SEPLN), 2017
Si bien se han hecho esfuerzos considerables para aplicar las tecnologias de mineria de texto a l... more Si bien se han hecho esfuerzos considerables para aplicar las tecnologias de mineria de texto a la literatura biomedica y los registros clinicos escritos en ingles, lo cierto es que intentos de procesar documentos en otros idiomas han atraido mucha menos atencion a pesar de su interes practico. Debido al considerable numero de documentos biomedicos escritos en espanol, existe una necesidad apremiante de poder acceder a los recursos de mineria de textos biomedicos y clinicos desarrollados para esta lengua de alto impacto. Para abordar este asunto, la Secretaria de Estado encargo las actuaciones de apoyo tecnico especializado para el desarrollo del Plan de Impulso de las tecnologias del Lenguaje en el ambito de la biomedicina. El articulo describe brevemente las lineas principales de actuacion del proyecto en su primera fase, esto es: facilitar el acceso a recursos y herramientas en PNL, analizar y garantizar la interoperabilidad del sistema, la definicion de metodos y herramientas de evaluacion, la difusion del proyecto y sus resultados y la alineacion y colaboracion con otros proyectos nacionales e internacionales. Ademas, hemos identificado algunas de las tareas criticas en el procesamiento de textos biomedicos que requieren investigacion adicional y disponibilidad de herramientas.
This paper describes the Lexicographic Station Development Platform and how it has been used to i... more This paper describes the Lexicographic Station Development Platform and how it has been used to implement the resulting lexicon guidelines and standards generated by ISLE Computational Lexicon Group in a prototype tool for lexical encoding. The aims of the work described here were to (i) exemplify and disseminate the Multilingual ISLE Lexical Entry (MILE) using an actual model and available monolingual data (ii) make extensive use of already existing PAROLE and SIMPLE lexicons and (iii) to eventually test the goodness of the guidelines by using a real scenario. To cope with these aims, the LSDP was designed as a tool generator which could automatically generate a prototype lexicographic station out of ISLE guidelines when formally expressed in a DTD. Thus, we have tested and exemplified the recommendations expressed in MILE but in addition we have also proved that MILE can be implemented on already existing monolingual resources.
This paper reports our experience when integrating differ resources and services into a grid envi... more This paper reports our experience when integrating differ resources and services into a grid environment. The use case we address implies the deployment of several NLP applications as web services. The ultimate objective of this task was to create a scenario where researchers have access to a variety of services they can operate. These services should be easy to invoke and able to interoperate between one another. We essentially describe the interoperability problems we faced, which involve metadata interoperability, data interoperability and service interoperability. We devote special attention to service interoperability and explore the possibility to define common interfaces and semantic description of services. While the web services paradigm suits the integration of different services very well, this requires mutual understanding and the accommodation to common interfaces that not only provide technical solution but also ease the user‟s work. Defining common interfaces benefits...
Abstract. This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which ... more Abstract. This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available. Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation. 1