Doaa Samy | Cairo University (original) (raw)

Papers by Doaa Samy

Research paper thumbnail of Recursos bilingues de ingenieria linguistica para el procesmiento de Español y Árabe

Research paper thumbnail of Detecting generic drugs in biomedical texts

Este trabajo presenta un sistema para el reconocimiento y clasificación de nombres genéricos de f... more Este trabajo presenta un sistema para el reconocimiento y clasificación de nombres genéricos de fármacos en textos biomédicos. El sistema combina información del Metatesauro UMLS y reglas de nomenclatura para fármacos genéricos, recomendadas por el consejo “United States Adoptated Names” (USAN), que permiten la clasificación de los fármacos en familias farmacológicas. La hipótesis de partida es que las reglas USAN son capaces de detectar posibles candidatos de fármacos que no están incluidos en UMLS (versión 2007AC), aumentando la cobertura del sistema. El sistema consigue un 100% de precisión y un 97% de cobertura usando sólo UMLS sobre una colección de 1481 resúmenes de artículos científicos de PubMed. La combinación de las reglas USAN con UMLS mejoran ligeramente la cobertura del sistema.This paper presents a system for drug name recognition and clasification in biomedical texts. The system combines information from UMLS Metathesaurus and nomenclatura rules for generic drugs, rec...

Research paper thumbnail of *Computational Linguistics Laboratory-Autónoma University Madrid

This paper tests two different strategies for medical term extraction in an Arabic Medical Corpus... more This paper tests two different strategies for medical term extraction in an Arabic Medical Corpus. The experiments and the corpus are developed within the framework of Multimedica project funded by the Spanish Ministry of Science and Innovation and aiming at developing multilingual resources and tools for processing of newswire texts in the Health domain. The first experiment uses a fixed list of medical terms, the second experiment uses a list of Arabic equivalents of very limited list of common Latin prefix and suffix used in medical terms. Results show that using equivalents of Latin suffix and prefix outperforms the fixed list. The paper starts with an introduction, followed by a description of the state-of-art in the field of Arabic Medical Language Resources (LRs). The third section describes the corpus and its characteristics. The fourth and the fifth sections explain the lists used and the results of the experiments carried out on a sub-corpus for evaluation. The last sectio...

Research paper thumbnail of UC3M: Classification of Semantic Relations between Nominals using Sequential Minimal Optimization

This paper presents a method for automatic classification of semantic relations between nominals ... more This paper presents a method for automatic classification of semantic relations between nominals using Sequential

Research paper thumbnail of Applicability of ICT-supported language teaching in contexts of social integration and international cooperation

Círculo de Lingüística Aplicada a la Comunicación, 2018

This paper offers two examples of the applicability of ICT-supported language teaching in two dif... more This paper offers two examples of the applicability of ICT-supported language teaching in two different contexts: social integration and international cooperation. A search for methodologically innovative approaches in this field inspires a series of collaborative international projects involving Egypt and several European countries. The introduction of best practices in this field brings an improvement in the language teaching and learning process in Egyptian universities, as well as meaningful insights on the advantages and shortcomings of the different options available. ICT facilitates the creation of open access materials available to disadvantaged groups (refugees or immigrants) that are outside conventional educational contexts but need tools and resources for a fast acquisition of foreign languages. An example of these tools is the development of a new multilingual smartphone app based on communication needs. The app is currently being developed within an international conso...

Research paper thumbnail of A proposal for an Arabic named entity tagger leveraging a parallel corpus

International Conference RANLP, …, 2005

... Parallel Corpus * Doaa Samy Laboratorio de Lingüística Informática Universidad Autónoma de Ma... more ... Parallel Corpus * Doaa Samy Laboratorio de Lingüística Informática Universidad Autónoma de Madrid doaa@maria.lllf.uam.es Antonio ... Retrieval. Since 1995, a lot of studies have ad-dressed NE recognition, tagging and clas-sification. ...

Research paper thumbnail of Of Temporal Expressions in

empirical approach to a preliminary successful identification and resolution

Research paper thumbnail of Subtitling for Intercultural Communication in Foreign Language Learning/Teaching: The case of Dhat, an Egyptian Series Subtitled in Spanish

Caracteres: Estudios Culturales y Críticos de la Esfera Digital, 2017

The objective of this paper is to present a teaching resource for Arabic as a foreign language, d... more The objective of this paper is to present a teaching resource for Arabic as a foreign language, developed within the context of the European project E-LENGUA. E-LENGUA is funded by Erasmus+ programme and it aims at developing new resources and sharing good practices in foreign language teaching using modern information and communication technologies. The resource consists of subtitling a set of ten episodes of the Egyptian TV series, Dhat. The subtitles are in Spanish and the series depicts the evolution of the Egyptian society, from the fifties until the Egyptian Revolution in 2011 through the biography of Dhat, an Egyptian middle class girl born in the fifties in Cairo. The paper is divided into four sections. The first is an introduction covering the theoretical framework. The second presents the methodology and the scope of work. The learning activities and its evaluation are addressed in the third section. Finally, the conclusions and future work are presented.

Research paper thumbnail of 1. The LLI-UAM Multilingual Parallel Corpus: A New Resource *

This paper presents the results (1st phase) of the on-going research in the Computational Linguis... more This paper presents the results (1st phase) of the on-going research in the Computational Linguistics Laboratory at Autónoma University of Madrid (LLI-UAM) aiming at the development of a multi-lingual parallel corpus (Arabic-Spanish-English) aligned on the sentence level and tagged on the POS level. A multilingual parallel corpus which brings together Arabic, Spanish and English is a new resource for the NLP community that completes the present panorama of parallel corpora. In the first part of this study, we introduce the novelty of our approach and the challenges encountered to create such a corpus. This introductory part highlights the main features of the corpus and the criteria applied during the selection process. The second part focuses on two main stages: basic processing (tokenization and segmentation) and alignment. Methodology of alignment is explained in detail and results obtained in the three different linguistic pairs are compared. POS tagging and tools used in this s...

Research paper thumbnail of Corpus Viewer: NLP and ML-based Platform for Public Policy Making and Implementation

Proces. del Leng. Natural, 2019

Corpus Viewer is a production service developed by the State Secretary for Digital Advancement (S... more Corpus Viewer is a production service developed by the State Secretary for Digital Advancement (SEAD) within the framework of the National Language Technologies Plan (Plan TL), promoted by the same State Secretary. Corpus Viewer relies on Natural Language Processing (NLP), Machine Learning (ML) and Machine Translation (MT) to analyze structured metadata and unstructured textual data in large document corpora. The platform allows the decision maker and the policy implementer the possibility of analyze R&D&i information space (mainly patents, scientific publications and public aids) for evidence and knowledge-based policy making and implementation. In this paper, we describe the main functionalities of the platform and enumerate the techniques it is based on, which include a variety of methods like document topic modeling and graph analysis.

Research paper thumbnail of Towards Resolving Morphological Ambiguity in Arabic Intelligent Language Tutoring Framework

The current paper deals with the relation between language resources and Computer Assisted Langua... more The current paper deals with the relation between language resources and Computer Assisted Language Learning (CALL) systems: language resources are essential in the development of CALL applications, during the development of the system resources are created, and finally the CALL system itself can be used to generate additional resources that are useful for research and development of new (CALL) systems. We focus on the system developed in the project DISCO (Development and Integration of Speech technology into COurseware for language learning): we describe the language resources employed for developing the DISCO system and present the DISCO system paying attention to the design, the automatic speech recognition modules, and the resources produced within the project. Finally, we discuss how additional language resources can be generated through the DISCO system.

Research paper thumbnail of An Online Tool for Enhancing NLP of a Biomedical Corpus

This work presents an online interface that allows the user to search words and medical terms in ... more This work presents an online interface that allows the user to search words and medical terms in the MultiMedica corpus, which gathers 51,476 texts in Spanish, Japanese and Arabic. In order to develop the tool, several natural language processing (NLP) techniques were applied: firstly, a number of corpora were processed and Part-of-Speech-tagged using morphological analysers for each language; then, the tagged texts were indexed to enhance online queries; thirdly, lists of medical terms were collected for each language. The online tool features word query system, a term query system, and a medical term extractor. The word query system makes it possible to look up items according to word form, lemma, category or string. The medical term query system features an autocomplete function to enhance the input of the query, which is based on the 5000 more frequent terms in the corpus. Finally, the term extractor detects candidate medical terms in an input text, and highlights them according...

Research paper thumbnail of An Empirical Approach to a Preliminary Successful Identification and Resolution of Temporal Expressions in Spanish News Corpora

Dating of contents is relevant to multiple advanced Natural Language Processing (NLP) application... more Dating of contents is relevant to multiple advanced Natural Language Processing (NLP) applications, such as Information Retrieval or Question Answering. These could be improved by using techniques that consider a temporal dimension in their processes. To achieve it, an accurate detection of temporal expressions in data sources must be firstly done, dealing with them in an appropriated standard format that captures the time value of the expressions once resolved, and allows reasoning without ambiguity, in order to increase the range of search and the quality of the results to be returned. These tasks are completely necessary for NLP applications if an efficient temporal reasoning is afterwards expected. This work presents a typology of time expressions based on an empirical inductive approach, both from a structural perspective and from the point of view of their resolution. Furthermore, a method for the automatic recognition and resolution of temporal expressions in Spanish contents...

Research paper thumbnail of Reconocimiento y clasificación de entidades nombradas en textos legales en español

Procesamiento Del Lenguaje Natural, 2021

El reconocimiento y la clasificacion de las entidades nombradas (NER/NERC) es una tarea principal... more El reconocimiento y la clasificacion de las entidades nombradas (NER/NERC) es una tarea principal en las areas del Procesamiento del Lenguaje Natural (PLN) y la Extraccion de la Informacion. El papel de NERC en el dominio legal es imprescindible en el desarrollo de sistemas legales inteligentes. El presente trabajo pretende dar un primer paso hacia establecer un "baseline" para la tarea NERC en el espanol juridico. El objetivo principal consiste en proporcionar un recurso linguistico anotando cinco tipos basicos de entidades nombradas en los textos legislativos en espanol peninsular. Los cinco tipos de entidades nombradas son: Personas, Organizaciones, Lugares, Fechas absolutas y Referencias a leyes, decretos, ordenes, normativas y articulos. Se adopta una metodologia hibrida que reune tres tecnicas principales: Patrones de expresiones regulares, listas de fuentes externas y el entrenamiento de tres modelos NERC utilizando la libreria abierta spaCy v3. De los tres modelos ...

Research paper thumbnail of Corpus Viewer: una plataforma basada en PLN y Aprendizaje Automático para diseño e implementación de política pública

Procesamiento Del Lenguaje Natural, 2019

espanolCorpus Viewer es un servicio en produccion desarrollado por la Secretaria de Estado del Av... more espanolCorpus Viewer es un servicio en produccion desarrollado por la Secretaria de Estado del Avance Digital dentro del marco del Plan de Impulso de Tecnologias del Lenguaje (Plan TL). Se basa en tecnicas de Procesamiento del Lenguaje Natural (PLN) y Aprendizaje Automatico para analizar datos estructurados y no-estructurados en grandes colecciones de documentos como las patentes, las publicaciones cientificas de acceso abierto, los proyectos europeos, etc. El objetivo es ofrecer al decisor politico y al gestor la posibilidad de navegar en el espacio de la informacion teniendo una vision de conjunto que le ayude a tomar decisiones basadas en conocimiento y evidencias. En este articulo, se describen las funcionalidades basicas de la plataforma enumerando las tecnicas empleadas que incluyen, entre otros, modelados de topicos y analisis de grafos. EnglishCorpus Viewer is a production service developed by the State Secretary for Digital Advancement (SEAD) within the framework of the Nat...

Research paper thumbnail of Caracterización del sector de Tecnologías del Lenguaje mediante modelado de tópicos y análisis de grafos: Visión general de la participación española

Procesamiento Del Lenguaje Natural, 2019

espanolEl presente trabajo aplica herramientas de modelado de topicos y analisis de grafos para c... more espanolEl presente trabajo aplica herramientas de modelado de topicos y analisis de grafos para caracterizar el sector de Tecnologias del Lenguaje (TL) en Espana. Para ello, se estudian el repositorio de ACL Anthology. Este analisis tiene en cuenta los datos estructurados y no-estructurados en dichas fuentes con el fin de retratar el panorama actual en terminos de tematicas subyacentes y su evolucion en los ultimos anos en comparacion con la comunidad internacional. Los resultados se presentan mediante una visualizacion interactiva que permite navegar en el espacio de TL en el intervalo temporal 1983-2018. EnglishThis paper aims at landscaping the Human Language Technologies (HLT) sector by applying topic modeling and graph analysis to study the scientific literature in ACL Anthology with special emphasis on the Spanish participation. The analysis takes into account the structured and unstructured data to offer an overview of the HLT landscape in Spain identifying main underlying th...

Research paper thumbnail of Marcadores discursivos en árabe y en español: un estudio computacional basado en corpus paralelos con anotación pragmática

El objetivo de este artículo ha sido el de analizar el modo en que se ha llevado a cabo la traduc... more El objetivo de este artículo ha sido el de analizar el modo en que se ha llevado a cabo la traducción de los marcadores discursivos del árabe y del español en el corpus paralelo de la ONU desde una perspectiva computacional. La investigación está dividida en tres partes. La primera de ellas está dedicada a la presentación de los recursos. En ella se exponen las características más importantes del corpus de la ONU, por un lado, y por otro, se explica el modelo de anotación pragmática (PRAGMATEXT) utilizado para clasificar los marcadores discursivos. Los fenómenos de naturaleza semántico-pragmática que se explican en el modelo de anotación son: lenguaje emocional, relaciones discursivas, actos de habla, modalización, evidencialidad y deixis. La segunda parte está dedicada a los marcadores discursivos en la parte española del corpus. En ella se explicarán los fenómenos discursivos que se han codificado a través de los marcadores discursivos, así como la frecuencia de aparición de estos...

Research paper thumbnail of Construction of a Bilingual Arabic-Spanish Lexicon of Verbs Based on a Parallel Corpus

Parallel corpora are considered an important resource for the development of linguistic tools. In... more Parallel corpora are considered an important resource for the development of linguistic tools. In this paper our main goal is the development of a bilingual lexicon of verbs. The construction of this lexicon is possible using two main resources: I) a parallel corpus (through the alignment); II) the linguistic tools developed for Spanish (which serve as a starting point for developing tools for Arabic language). At the end, aligned equivalent verbs are detected automatically from a parallel corpus Spanish-Arabic. To achieve this goal, we had to pass through different preparatory stages concerning the assesment of the parallel corpus, the monolingual tokenization of each corpus, a preliminary sentence alignment and finally applying the model of automatic extraction of equivalent verbs. Our method is hybrid, since it combines both statistical and linguistic approaches.

Research paper thumbnail of Detección de fármacos genéricos en textos biomédicos

Proces. del Leng. Natural, 2008

This paper presents a system for drug name recognition and clasification in biomedical texts. The... more This paper presents a system for drug name recognition and clasification in biomedical texts. The system combines information from UMLS Metathesaurus and nomenclatura rules for generic drugs, recommended by United States Adoptated Names (USAN), that allow the classification of the drugs in pharmacologic families. The initial hypothesis is that rules are able to detect possible candidates of drug names which are not included in the UMLS database (version 2007AC), increasing, in this way, the coverage of the system. The system achieves a 100% precision and 97% recall using UMLS only. The combination of the USAN rules and UMLS slightly improves the coverage of the system.

Research paper thumbnail of Landscaping Language Technologies using Topic Modeling and Graph Analysis: Overview of the Spanish Contribution

Proces. del Leng. Natural, 2019

This work has been carried out in the framework of the Spanish State Plan for Natural Language Te... more This work has been carried out in the framework of the Spanish State Plan for Natural Language Technologies. The work of J. Arenas-Garcia has also been partly funded by MINECO projects TEC2014-52289-R and TEC2017-83838-R.

Research paper thumbnail of Recursos bilingues de ingenieria linguistica para el procesmiento de Español y Árabe

Research paper thumbnail of Detecting generic drugs in biomedical texts

Este trabajo presenta un sistema para el reconocimiento y clasificación de nombres genéricos de f... more Este trabajo presenta un sistema para el reconocimiento y clasificación de nombres genéricos de fármacos en textos biomédicos. El sistema combina información del Metatesauro UMLS y reglas de nomenclatura para fármacos genéricos, recomendadas por el consejo “United States Adoptated Names” (USAN), que permiten la clasificación de los fármacos en familias farmacológicas. La hipótesis de partida es que las reglas USAN son capaces de detectar posibles candidatos de fármacos que no están incluidos en UMLS (versión 2007AC), aumentando la cobertura del sistema. El sistema consigue un 100% de precisión y un 97% de cobertura usando sólo UMLS sobre una colección de 1481 resúmenes de artículos científicos de PubMed. La combinación de las reglas USAN con UMLS mejoran ligeramente la cobertura del sistema.This paper presents a system for drug name recognition and clasification in biomedical texts. The system combines information from UMLS Metathesaurus and nomenclatura rules for generic drugs, rec...

Research paper thumbnail of *Computational Linguistics Laboratory-Autónoma University Madrid

This paper tests two different strategies for medical term extraction in an Arabic Medical Corpus... more This paper tests two different strategies for medical term extraction in an Arabic Medical Corpus. The experiments and the corpus are developed within the framework of Multimedica project funded by the Spanish Ministry of Science and Innovation and aiming at developing multilingual resources and tools for processing of newswire texts in the Health domain. The first experiment uses a fixed list of medical terms, the second experiment uses a list of Arabic equivalents of very limited list of common Latin prefix and suffix used in medical terms. Results show that using equivalents of Latin suffix and prefix outperforms the fixed list. The paper starts with an introduction, followed by a description of the state-of-art in the field of Arabic Medical Language Resources (LRs). The third section describes the corpus and its characteristics. The fourth and the fifth sections explain the lists used and the results of the experiments carried out on a sub-corpus for evaluation. The last sectio...

Research paper thumbnail of UC3M: Classification of Semantic Relations between Nominals using Sequential Minimal Optimization

This paper presents a method for automatic classification of semantic relations between nominals ... more This paper presents a method for automatic classification of semantic relations between nominals using Sequential

Research paper thumbnail of Applicability of ICT-supported language teaching in contexts of social integration and international cooperation

Círculo de Lingüística Aplicada a la Comunicación, 2018

This paper offers two examples of the applicability of ICT-supported language teaching in two dif... more This paper offers two examples of the applicability of ICT-supported language teaching in two different contexts: social integration and international cooperation. A search for methodologically innovative approaches in this field inspires a series of collaborative international projects involving Egypt and several European countries. The introduction of best practices in this field brings an improvement in the language teaching and learning process in Egyptian universities, as well as meaningful insights on the advantages and shortcomings of the different options available. ICT facilitates the creation of open access materials available to disadvantaged groups (refugees or immigrants) that are outside conventional educational contexts but need tools and resources for a fast acquisition of foreign languages. An example of these tools is the development of a new multilingual smartphone app based on communication needs. The app is currently being developed within an international conso...

Research paper thumbnail of A proposal for an Arabic named entity tagger leveraging a parallel corpus

International Conference RANLP, …, 2005

... Parallel Corpus * Doaa Samy Laboratorio de Lingüística Informática Universidad Autónoma de Ma... more ... Parallel Corpus * Doaa Samy Laboratorio de Lingüística Informática Universidad Autónoma de Madrid doaa@maria.lllf.uam.es Antonio ... Retrieval. Since 1995, a lot of studies have ad-dressed NE recognition, tagging and clas-sification. ...

Research paper thumbnail of Of Temporal Expressions in

empirical approach to a preliminary successful identification and resolution

Research paper thumbnail of Subtitling for Intercultural Communication in Foreign Language Learning/Teaching: The case of Dhat, an Egyptian Series Subtitled in Spanish

Caracteres: Estudios Culturales y Críticos de la Esfera Digital, 2017

The objective of this paper is to present a teaching resource for Arabic as a foreign language, d... more The objective of this paper is to present a teaching resource for Arabic as a foreign language, developed within the context of the European project E-LENGUA. E-LENGUA is funded by Erasmus+ programme and it aims at developing new resources and sharing good practices in foreign language teaching using modern information and communication technologies. The resource consists of subtitling a set of ten episodes of the Egyptian TV series, Dhat. The subtitles are in Spanish and the series depicts the evolution of the Egyptian society, from the fifties until the Egyptian Revolution in 2011 through the biography of Dhat, an Egyptian middle class girl born in the fifties in Cairo. The paper is divided into four sections. The first is an introduction covering the theoretical framework. The second presents the methodology and the scope of work. The learning activities and its evaluation are addressed in the third section. Finally, the conclusions and future work are presented.

Research paper thumbnail of 1. The LLI-UAM Multilingual Parallel Corpus: A New Resource *

This paper presents the results (1st phase) of the on-going research in the Computational Linguis... more This paper presents the results (1st phase) of the on-going research in the Computational Linguistics Laboratory at Autónoma University of Madrid (LLI-UAM) aiming at the development of a multi-lingual parallel corpus (Arabic-Spanish-English) aligned on the sentence level and tagged on the POS level. A multilingual parallel corpus which brings together Arabic, Spanish and English is a new resource for the NLP community that completes the present panorama of parallel corpora. In the first part of this study, we introduce the novelty of our approach and the challenges encountered to create such a corpus. This introductory part highlights the main features of the corpus and the criteria applied during the selection process. The second part focuses on two main stages: basic processing (tokenization and segmentation) and alignment. Methodology of alignment is explained in detail and results obtained in the three different linguistic pairs are compared. POS tagging and tools used in this s...

Research paper thumbnail of Corpus Viewer: NLP and ML-based Platform for Public Policy Making and Implementation

Proces. del Leng. Natural, 2019

Corpus Viewer is a production service developed by the State Secretary for Digital Advancement (S... more Corpus Viewer is a production service developed by the State Secretary for Digital Advancement (SEAD) within the framework of the National Language Technologies Plan (Plan TL), promoted by the same State Secretary. Corpus Viewer relies on Natural Language Processing (NLP), Machine Learning (ML) and Machine Translation (MT) to analyze structured metadata and unstructured textual data in large document corpora. The platform allows the decision maker and the policy implementer the possibility of analyze R&D&i information space (mainly patents, scientific publications and public aids) for evidence and knowledge-based policy making and implementation. In this paper, we describe the main functionalities of the platform and enumerate the techniques it is based on, which include a variety of methods like document topic modeling and graph analysis.

Research paper thumbnail of Towards Resolving Morphological Ambiguity in Arabic Intelligent Language Tutoring Framework

The current paper deals with the relation between language resources and Computer Assisted Langua... more The current paper deals with the relation between language resources and Computer Assisted Language Learning (CALL) systems: language resources are essential in the development of CALL applications, during the development of the system resources are created, and finally the CALL system itself can be used to generate additional resources that are useful for research and development of new (CALL) systems. We focus on the system developed in the project DISCO (Development and Integration of Speech technology into COurseware for language learning): we describe the language resources employed for developing the DISCO system and present the DISCO system paying attention to the design, the automatic speech recognition modules, and the resources produced within the project. Finally, we discuss how additional language resources can be generated through the DISCO system.

Research paper thumbnail of An Online Tool for Enhancing NLP of a Biomedical Corpus

This work presents an online interface that allows the user to search words and medical terms in ... more This work presents an online interface that allows the user to search words and medical terms in the MultiMedica corpus, which gathers 51,476 texts in Spanish, Japanese and Arabic. In order to develop the tool, several natural language processing (NLP) techniques were applied: firstly, a number of corpora were processed and Part-of-Speech-tagged using morphological analysers for each language; then, the tagged texts were indexed to enhance online queries; thirdly, lists of medical terms were collected for each language. The online tool features word query system, a term query system, and a medical term extractor. The word query system makes it possible to look up items according to word form, lemma, category or string. The medical term query system features an autocomplete function to enhance the input of the query, which is based on the 5000 more frequent terms in the corpus. Finally, the term extractor detects candidate medical terms in an input text, and highlights them according...

Research paper thumbnail of An Empirical Approach to a Preliminary Successful Identification and Resolution of Temporal Expressions in Spanish News Corpora

Dating of contents is relevant to multiple advanced Natural Language Processing (NLP) application... more Dating of contents is relevant to multiple advanced Natural Language Processing (NLP) applications, such as Information Retrieval or Question Answering. These could be improved by using techniques that consider a temporal dimension in their processes. To achieve it, an accurate detection of temporal expressions in data sources must be firstly done, dealing with them in an appropriated standard format that captures the time value of the expressions once resolved, and allows reasoning without ambiguity, in order to increase the range of search and the quality of the results to be returned. These tasks are completely necessary for NLP applications if an efficient temporal reasoning is afterwards expected. This work presents a typology of time expressions based on an empirical inductive approach, both from a structural perspective and from the point of view of their resolution. Furthermore, a method for the automatic recognition and resolution of temporal expressions in Spanish contents...

Research paper thumbnail of Reconocimiento y clasificación de entidades nombradas en textos legales en español

Procesamiento Del Lenguaje Natural, 2021

El reconocimiento y la clasificacion de las entidades nombradas (NER/NERC) es una tarea principal... more El reconocimiento y la clasificacion de las entidades nombradas (NER/NERC) es una tarea principal en las areas del Procesamiento del Lenguaje Natural (PLN) y la Extraccion de la Informacion. El papel de NERC en el dominio legal es imprescindible en el desarrollo de sistemas legales inteligentes. El presente trabajo pretende dar un primer paso hacia establecer un "baseline" para la tarea NERC en el espanol juridico. El objetivo principal consiste en proporcionar un recurso linguistico anotando cinco tipos basicos de entidades nombradas en los textos legislativos en espanol peninsular. Los cinco tipos de entidades nombradas son: Personas, Organizaciones, Lugares, Fechas absolutas y Referencias a leyes, decretos, ordenes, normativas y articulos. Se adopta una metodologia hibrida que reune tres tecnicas principales: Patrones de expresiones regulares, listas de fuentes externas y el entrenamiento de tres modelos NERC utilizando la libreria abierta spaCy v3. De los tres modelos ...

Research paper thumbnail of Corpus Viewer: una plataforma basada en PLN y Aprendizaje Automático para diseño e implementación de política pública

Procesamiento Del Lenguaje Natural, 2019

espanolCorpus Viewer es un servicio en produccion desarrollado por la Secretaria de Estado del Av... more espanolCorpus Viewer es un servicio en produccion desarrollado por la Secretaria de Estado del Avance Digital dentro del marco del Plan de Impulso de Tecnologias del Lenguaje (Plan TL). Se basa en tecnicas de Procesamiento del Lenguaje Natural (PLN) y Aprendizaje Automatico para analizar datos estructurados y no-estructurados en grandes colecciones de documentos como las patentes, las publicaciones cientificas de acceso abierto, los proyectos europeos, etc. El objetivo es ofrecer al decisor politico y al gestor la posibilidad de navegar en el espacio de la informacion teniendo una vision de conjunto que le ayude a tomar decisiones basadas en conocimiento y evidencias. En este articulo, se describen las funcionalidades basicas de la plataforma enumerando las tecnicas empleadas que incluyen, entre otros, modelados de topicos y analisis de grafos. EnglishCorpus Viewer is a production service developed by the State Secretary for Digital Advancement (SEAD) within the framework of the Nat...

Research paper thumbnail of Caracterización del sector de Tecnologías del Lenguaje mediante modelado de tópicos y análisis de grafos: Visión general de la participación española

Procesamiento Del Lenguaje Natural, 2019

espanolEl presente trabajo aplica herramientas de modelado de topicos y analisis de grafos para c... more espanolEl presente trabajo aplica herramientas de modelado de topicos y analisis de grafos para caracterizar el sector de Tecnologias del Lenguaje (TL) en Espana. Para ello, se estudian el repositorio de ACL Anthology. Este analisis tiene en cuenta los datos estructurados y no-estructurados en dichas fuentes con el fin de retratar el panorama actual en terminos de tematicas subyacentes y su evolucion en los ultimos anos en comparacion con la comunidad internacional. Los resultados se presentan mediante una visualizacion interactiva que permite navegar en el espacio de TL en el intervalo temporal 1983-2018. EnglishThis paper aims at landscaping the Human Language Technologies (HLT) sector by applying topic modeling and graph analysis to study the scientific literature in ACL Anthology with special emphasis on the Spanish participation. The analysis takes into account the structured and unstructured data to offer an overview of the HLT landscape in Spain identifying main underlying th...

Research paper thumbnail of Marcadores discursivos en árabe y en español: un estudio computacional basado en corpus paralelos con anotación pragmática

El objetivo de este artículo ha sido el de analizar el modo en que se ha llevado a cabo la traduc... more El objetivo de este artículo ha sido el de analizar el modo en que se ha llevado a cabo la traducción de los marcadores discursivos del árabe y del español en el corpus paralelo de la ONU desde una perspectiva computacional. La investigación está dividida en tres partes. La primera de ellas está dedicada a la presentación de los recursos. En ella se exponen las características más importantes del corpus de la ONU, por un lado, y por otro, se explica el modelo de anotación pragmática (PRAGMATEXT) utilizado para clasificar los marcadores discursivos. Los fenómenos de naturaleza semántico-pragmática que se explican en el modelo de anotación son: lenguaje emocional, relaciones discursivas, actos de habla, modalización, evidencialidad y deixis. La segunda parte está dedicada a los marcadores discursivos en la parte española del corpus. En ella se explicarán los fenómenos discursivos que se han codificado a través de los marcadores discursivos, así como la frecuencia de aparición de estos...

Research paper thumbnail of Construction of a Bilingual Arabic-Spanish Lexicon of Verbs Based on a Parallel Corpus

Parallel corpora are considered an important resource for the development of linguistic tools. In... more Parallel corpora are considered an important resource for the development of linguistic tools. In this paper our main goal is the development of a bilingual lexicon of verbs. The construction of this lexicon is possible using two main resources: I) a parallel corpus (through the alignment); II) the linguistic tools developed for Spanish (which serve as a starting point for developing tools for Arabic language). At the end, aligned equivalent verbs are detected automatically from a parallel corpus Spanish-Arabic. To achieve this goal, we had to pass through different preparatory stages concerning the assesment of the parallel corpus, the monolingual tokenization of each corpus, a preliminary sentence alignment and finally applying the model of automatic extraction of equivalent verbs. Our method is hybrid, since it combines both statistical and linguistic approaches.

Research paper thumbnail of Detección de fármacos genéricos en textos biomédicos

Proces. del Leng. Natural, 2008

This paper presents a system for drug name recognition and clasification in biomedical texts. The... more This paper presents a system for drug name recognition and clasification in biomedical texts. The system combines information from UMLS Metathesaurus and nomenclatura rules for generic drugs, recommended by United States Adoptated Names (USAN), that allow the classification of the drugs in pharmacologic families. The initial hypothesis is that rules are able to detect possible candidates of drug names which are not included in the UMLS database (version 2007AC), increasing, in this way, the coverage of the system. The system achieves a 100% precision and 97% recall using UMLS only. The combination of the USAN rules and UMLS slightly improves the coverage of the system.

Research paper thumbnail of Landscaping Language Technologies using Topic Modeling and Graph Analysis: Overview of the Spanish Contribution

Proces. del Leng. Natural, 2019

This work has been carried out in the framework of the Spanish State Plan for Natural Language Te... more This work has been carried out in the framework of the Spanish State Plan for Natural Language Technologies. The work of J. Arenas-Garcia has also been partly funded by MINECO projects TEC2014-52289-R and TEC2017-83838-R.