Miguel Angel Alonso Pardo - Profile on Academia.edu (original) (raw)
Papers by Miguel Angel Alonso Pardo
Applied Sciences
Parsing is a core natural language processing technique that can be used to obtain the structure ... more Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic informa...
We describe four systems to generate automatically bilingual dictionaries based on existing ones:... more We describe four systems to generate automatically bilingual dictionaries based on existing ones: three transitive systems differing only in the pivot language used, and a system based on a different approach which only needs monolingual corpora in both the source and target languages. All four methods make use of cross-lingual word embeddings trained on monolingual corpora, and then mapped into a shared vec- tor space. Experimental results confirm that our strategy has a good coverage and recall, achieving a performance comparable to to the best submitted systems on the TIAD 2019 gold standard set among the teams participating at the TIAD shared task.
Proces. del Leng. Natural, 2016
En este trabajo presentamos una nueva estrategia para crear treebanks de lenguas con pocos recurs... more En este trabajo presentamos una nueva estrategia para crear treebanks de lenguas con pocos recursos para el analisis sintactico. El metodo consiste en la adaptacion y combinacion de diferentes treebanks anotados con dependencias universales de variedades linguisticas proximas, con el objetivo de entrenar un analizador sintactico para la lengua elegida, en nuestro caso el gallego. Durante el proceso de seleccion y adaptacion de los treebanks de origen, analizamos el impacto de propiedades de tres niveles diferentes: (i) la distancia entre las lenguas de origen y destino, (ii) la adaptacion de caracteristicas lexico-ortograficas, y (iii) las directrices de anotacion entre los treebanks. Usando la estrategia propuesta, entrenamos un analizador sintactico estadistico para etiquetar, con resultados prometedores y sin datos previos de gallego, un peque˜no corpus de esta lengua. La correccion manual de este corpus, usado como gold-standard, nos permitio probar la eficacia del metodo propue...
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2017
Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are ... more Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-based Iberian system with rules shared across five official languages: Basque, Catalan, Galician, Portuguese and Spanish. The model is made available. 1 * DV was funded by MECD (FPU13/01180). MG is funded by a Juan de la Cierva grant (FJCI-2014-22853). CGR has received funding from the ERC, under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150). This research was supported by MINECO (FFI2014-51978-C2). 1 The resources used in this work have been integrated as a part of https://github.com/aghie/uuusa 2. Training a single model for multilingual parsing is feasible (Ammar et al., 2016). 3. We can define universal rules for various phenomena, if 1 is assured (Vilares et al., 2017). Based on those, we: (a) combine existing subjectivity lexica, (b) train an Iberian tagger and parser, and (c) define a set of Iberian syntax-based rules. The main contributions of the paper are:
Normalizaci�n de t�rminos multipalabra mediante pares de dependencia sint�ctica
Pdln, 2001
On Non-Termination in DCGs
Compilation methods of minimal acyclic automata for large dictionaries
Extracción de término ındice mediante cascadas de expresiones regulares
El rendimiento de los sistemas de Recuperación de Información se ve limitado por los fenómenos de... more El rendimiento de los sistemas de Recuperación de Información se ve limitado por los fenómenos de variación lingüística presentes en los textos. Las técnicas de Procesamiento de Lenguaje Natural a nivel de palabra han mostrado su utilidad para reducir dicha variación. Proponemos en este artículo extender esta aproximación a la variación a nivel de frase; para ello se indexarán las dependencias sintácticas presentes en los documentos, las cuales son obtenidas por medio de un analizador sintáctico. Para reducir en lo posible el coste computacional asociado al proceso de análisis, hemos optado por emplear un analizador sintáctico superficial basado en cascadas de traductores de estado finito. Si bien este artículo se centra en el caso del español, nuestra aproximación es extensible a otros lenguajes adaptando convenientemente la gramática empleada por el analizador. Palabras clave: Análisis sintáctico superficial, traductores de estado finito.
LyS at TASS 2014: A Prototype for Extracting and Analysing Aspects from Spanish tweets
An Operational Model for Parsing De.nite Clause Grammars with In.nite Terms
Lecture Notes in Computer Science, 1999
Current Issues in Linguistic Theory, 2000
We present a Generalized LR parsing algorithm for extensions of context-free grammars. It di ers ... more We present a Generalized LR parsing algorithm for extensions of context-free grammars. It di ers from previous approaches in the use of dynamic programming techniques to cope with non determinism, instead of a graph-structured stack. The steps for deriving the algorithm from the classical Earley's parsing algorithm are shown.
Journal of Information Science, 2014
The vast number of opinions and reviews provided in Twitter is helpful in order to make interesti... more The vast number of opinions and reviews provided in Twitter is helpful in order to make interesting findings about a given industry, but given the huge number of messages published every day, it is important to detect the relevant ones. In this respect, the Twitter search functionality is not a practical tool when we want to poll messages dealing with a given set of general topics. This article presents an approach to classify Twitter messages into various topics. We tackle the problem from a linguistic angle, taking into account part-of-speech, syntactic and semantic information, showing how language processing techniques should be adapted to deal with the informal language present in Twitter messages. The TASS 2013 General corpus, a collection of tweets that has been specifically annotated to perform text analytics tasks, is used as the dataset in our evaluation framework. We carry out a wide range of experiments to determine which kinds of linguistic information have the greatest...
Lecture Notes in Computer Science, 2002
Adjunction is a powerful operation that makes Tree Adjoining Grammar (TAG) useful for describing ... more Adjunction is a powerful operation that makes Tree Adjoining Grammar (TAG) useful for describing the syntactic structure of natural languages. In practice, a large part of wide coverage grammars written following the TAG formalism is formed by trees that can be combined by means of the simpler kind of adjunction defined for Tree Insertion Grammar. In this paper, we describe a parsing algorithm that makes use of this characteristic to reduce the practical complexity of TAG parsing: the expensive standard adjunction operation is only considered in those cases in which the simpler cubic-time adjunction cannot be applied.
A Bidirectional Bottom-Up Parser for Tag
El sistema ERIAL: LEIRA, un entorno para RI basado en PLN
ir.ii.uam.es, 2010
Resumen En este artıculo presentamos el trabajo que en el Grupo LYS (Lengua y Sociedad de la Info... more Resumen En este artıculo presentamos el trabajo que en el Grupo LYS (Lengua y Sociedad de la Información) hemos venido desarrollando en fechas recientes en las áreas de recuperación de información tolerante a errores y recuperación de información multilingüe. El nexo común entre ambas lıneas de investigación es el empleo de n-gramas de caracteres como unidad de procesamiento, en detrimento de soluciones más convencionales basadas en palabras o frases. El empleo de n-gramas nos permite ...
Proceedings of Recent Advances in Natural Language Processing (International Conference RANLP 2007), ISBN, 2007
We present a technique for the construction of efficient prototypes for natural language parsing ... more We present a technique for the construction of efficient prototypes for natural language parsing based on the compilation of parsing schemata to executable implementations of their corresponding algorithms. Taking a simple description of a schema as input, Java code for the corresponding parsing algorithm is generated, including schema-specific indexing code in order to attain efficiency. Key words: parsing schemata, context-free grammars, tree-adjoining grammars
A New Approach to the Construction of Generalized LR Parsing Algorithms
Análisis eficiente de Gramáticas de Inserción de Arboles
ABSTRACT
LyS en TASS 2013: Analizando tuits en castellano a través de análisis de dependencias, lexicones ... more LyS en TASS 2013: Analizando tuits en castellano a través de análisis de dependencias, lexicones de opiniones y propiedades psicométricas del lenguaje
Applied Sciences
Parsing is a core natural language processing technique that can be used to obtain the structure ... more Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic informa...
We describe four systems to generate automatically bilingual dictionaries based on existing ones:... more We describe four systems to generate automatically bilingual dictionaries based on existing ones: three transitive systems differing only in the pivot language used, and a system based on a different approach which only needs monolingual corpora in both the source and target languages. All four methods make use of cross-lingual word embeddings trained on monolingual corpora, and then mapped into a shared vec- tor space. Experimental results confirm that our strategy has a good coverage and recall, achieving a performance comparable to to the best submitted systems on the TIAD 2019 gold standard set among the teams participating at the TIAD shared task.
Proces. del Leng. Natural, 2016
En este trabajo presentamos una nueva estrategia para crear treebanks de lenguas con pocos recurs... more En este trabajo presentamos una nueva estrategia para crear treebanks de lenguas con pocos recursos para el analisis sintactico. El metodo consiste en la adaptacion y combinacion de diferentes treebanks anotados con dependencias universales de variedades linguisticas proximas, con el objetivo de entrenar un analizador sintactico para la lengua elegida, en nuestro caso el gallego. Durante el proceso de seleccion y adaptacion de los treebanks de origen, analizamos el impacto de propiedades de tres niveles diferentes: (i) la distancia entre las lenguas de origen y destino, (ii) la adaptacion de caracteristicas lexico-ortograficas, y (iii) las directrices de anotacion entre los treebanks. Usando la estrategia propuesta, entrenamos un analizador sintactico estadistico para etiquetar, con resultados prometedores y sin datos previos de gallego, un peque˜no corpus de esta lengua. La correccion manual de este corpus, usado como gold-standard, nos permitio probar la eficacia del metodo propue...
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2017
Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are ... more Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-based Iberian system with rules shared across five official languages: Basque, Catalan, Galician, Portuguese and Spanish. The model is made available. 1 * DV was funded by MECD (FPU13/01180). MG is funded by a Juan de la Cierva grant (FJCI-2014-22853). CGR has received funding from the ERC, under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150). This research was supported by MINECO (FFI2014-51978-C2). 1 The resources used in this work have been integrated as a part of https://github.com/aghie/uuusa 2. Training a single model for multilingual parsing is feasible (Ammar et al., 2016). 3. We can define universal rules for various phenomena, if 1 is assured (Vilares et al., 2017). Based on those, we: (a) combine existing subjectivity lexica, (b) train an Iberian tagger and parser, and (c) define a set of Iberian syntax-based rules. The main contributions of the paper are:
Normalizaci�n de t�rminos multipalabra mediante pares de dependencia sint�ctica
Pdln, 2001
On Non-Termination in DCGs
Compilation methods of minimal acyclic automata for large dictionaries
Extracción de término ındice mediante cascadas de expresiones regulares
El rendimiento de los sistemas de Recuperación de Información se ve limitado por los fenómenos de... more El rendimiento de los sistemas de Recuperación de Información se ve limitado por los fenómenos de variación lingüística presentes en los textos. Las técnicas de Procesamiento de Lenguaje Natural a nivel de palabra han mostrado su utilidad para reducir dicha variación. Proponemos en este artículo extender esta aproximación a la variación a nivel de frase; para ello se indexarán las dependencias sintácticas presentes en los documentos, las cuales son obtenidas por medio de un analizador sintáctico. Para reducir en lo posible el coste computacional asociado al proceso de análisis, hemos optado por emplear un analizador sintáctico superficial basado en cascadas de traductores de estado finito. Si bien este artículo se centra en el caso del español, nuestra aproximación es extensible a otros lenguajes adaptando convenientemente la gramática empleada por el analizador. Palabras clave: Análisis sintáctico superficial, traductores de estado finito.
LyS at TASS 2014: A Prototype for Extracting and Analysing Aspects from Spanish tweets
An Operational Model for Parsing De.nite Clause Grammars with In.nite Terms
Lecture Notes in Computer Science, 1999
Current Issues in Linguistic Theory, 2000
We present a Generalized LR parsing algorithm for extensions of context-free grammars. It di ers ... more We present a Generalized LR parsing algorithm for extensions of context-free grammars. It di ers from previous approaches in the use of dynamic programming techniques to cope with non determinism, instead of a graph-structured stack. The steps for deriving the algorithm from the classical Earley's parsing algorithm are shown.
Journal of Information Science, 2014
The vast number of opinions and reviews provided in Twitter is helpful in order to make interesti... more The vast number of opinions and reviews provided in Twitter is helpful in order to make interesting findings about a given industry, but given the huge number of messages published every day, it is important to detect the relevant ones. In this respect, the Twitter search functionality is not a practical tool when we want to poll messages dealing with a given set of general topics. This article presents an approach to classify Twitter messages into various topics. We tackle the problem from a linguistic angle, taking into account part-of-speech, syntactic and semantic information, showing how language processing techniques should be adapted to deal with the informal language present in Twitter messages. The TASS 2013 General corpus, a collection of tweets that has been specifically annotated to perform text analytics tasks, is used as the dataset in our evaluation framework. We carry out a wide range of experiments to determine which kinds of linguistic information have the greatest...
Lecture Notes in Computer Science, 2002
Adjunction is a powerful operation that makes Tree Adjoining Grammar (TAG) useful for describing ... more Adjunction is a powerful operation that makes Tree Adjoining Grammar (TAG) useful for describing the syntactic structure of natural languages. In practice, a large part of wide coverage grammars written following the TAG formalism is formed by trees that can be combined by means of the simpler kind of adjunction defined for Tree Insertion Grammar. In this paper, we describe a parsing algorithm that makes use of this characteristic to reduce the practical complexity of TAG parsing: the expensive standard adjunction operation is only considered in those cases in which the simpler cubic-time adjunction cannot be applied.
A Bidirectional Bottom-Up Parser for Tag
El sistema ERIAL: LEIRA, un entorno para RI basado en PLN
ir.ii.uam.es, 2010
Resumen En este artıculo presentamos el trabajo que en el Grupo LYS (Lengua y Sociedad de la Info... more Resumen En este artıculo presentamos el trabajo que en el Grupo LYS (Lengua y Sociedad de la Información) hemos venido desarrollando en fechas recientes en las áreas de recuperación de información tolerante a errores y recuperación de información multilingüe. El nexo común entre ambas lıneas de investigación es el empleo de n-gramas de caracteres como unidad de procesamiento, en detrimento de soluciones más convencionales basadas en palabras o frases. El empleo de n-gramas nos permite ...
Proceedings of Recent Advances in Natural Language Processing (International Conference RANLP 2007), ISBN, 2007
We present a technique for the construction of efficient prototypes for natural language parsing ... more We present a technique for the construction of efficient prototypes for natural language parsing based on the compilation of parsing schemata to executable implementations of their corresponding algorithms. Taking a simple description of a schema as input, Java code for the corresponding parsing algorithm is generated, including schema-specific indexing code in order to attain efficiency. Key words: parsing schemata, context-free grammars, tree-adjoining grammars
A New Approach to the Construction of Generalized LR Parsing Algorithms
Análisis eficiente de Gramáticas de Inserción de Arboles
ABSTRACT
LyS en TASS 2013: Analizando tuits en castellano a través de análisis de dependencias, lexicones ... more LyS en TASS 2013: Analizando tuits en castellano a través de análisis de dependencias, lexicones de opiniones y propiedades psicométricas del lenguaje