Mario Barcala - Profile on Academia.edu (original) (raw)
Papers by Mario Barcala
O Corpus de Referencia do Galego Actual (CORGA): estado actual e perspectivas
Lingua, pobo e terra: estudos en homenaxe a Xesús Ferro Ruibal, 2016, ISBN 978-84-453-5236-6, págs. 445-473, 2016
Caplletra. Revista Internacional de Filologia, 2020
El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...
El proyecto Gari-Coter en el seno del proyecto RICOTERM2
sepln.org, 2007
En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilit... more En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilite la construcción de córpora textuales estructurados basados en XML.
Procesamiento del lenguaje natural, 2003
En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas... more En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas a la normalización de términos en Recuperación de Información Textual. El objetivo de dichas técnicas es el tratamiento de los fenómenos de variación lingüística morfológica y léxica. En concreto explorará la utilización de la lematización, su empleo combinado con el stemming y la expansión de consultas mediante umbrales de sinonimia.
Lecture Notes in Computer Science, 2002
We present a reflection on the evolution of the different methods for constructing minimal determ... more We present a reflection on the evolution of the different methods for constructing minimal deterministic acyclic finite-state automata from a finite set of words. We outline the most important methods, including the traditional ones (which consist of the combination of two phases: insertion of words and minimization of the partial automaton) and the incremental algorithms (which add new words one by one and minimize the resulting automaton on-the-fly, being much faster and having significantly lower memory requirements). We analyze their main features in order to provide some improvements for incremental constructions, and a general architecture that is needed to implement large dictionaries in natural language processing (NLP) applications.
Lecture Notes in Computer Science, 2002
Lecture Notes in Computer Science, 2000
This article presents two new approaches for term indexing which are particularly appropriate for... more This article presents two new approaches for term indexing which are particularly appropriate for languages with a rich lexis and morphology, such as Spanish, and need few resources to be applied. At word level, productive derivational morphology is used to conflate semantically related words. At sentence level, an approximate grammar is used to conflate syntactic and morphosyntactic variants of a given multi-word term into a common base form. Experimental results show remarkable improvements with regard to classical indexing methods.
Codificación y anotación del habla en un contexto bilingüe: el corpus ESLORA de español en Galicia
Dialectología digital del español, 2020, ISBN 9788418445316, págs. 189-224, 2020
This article provides an overview of the design and composition of the corpus ESLORA and shows it... more This article provides an overview of the design and composition of the corpus ESLORA and shows its usefulness in analysing social and situational variation. The corpus also contributes to the study of the processes of change related to the geographical variation of Spanish, since it records its use in a region with its own distinctive language. This aspect facilitates the recognition of the Spanish spoken in Galicia as a research object in the field of Hispanic dialectology, allowing for its comparison with other geographical varieties on an equal footing. The main focus of the paper is on some difficulties encountered during transcription, codification, and annotation of spoken recordings as well as with the arguments that justify the solutions taken by the research team.
Caplletra. Revista Internacional de Filologia
El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...
CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, Oct 10, 2018
ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations r... more ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations recorded in Galicia between 2007 and 2015. The design and construction of the corpus meets three objectives: to register the use of a variety of Spanish which to date has been scarcely documented, to gain additional insight into the methods for the construction of spoken corpora, and to develop computational tools for corpus search. The paper presents the main characteristics of ESLORA and the criteria followed in the corpus building process. It also includes a brief description of the tools used to build the corpus and how they work together to achieve the project needs and, moreover, it shows that the decisions taken at various stages of the compilation of the corpus are closely related to the wide range of possibilities for retrieving the lexical, grammatical and contextual information provided by the materials.
Compilation methods of minimal acyclic automata for large dictionaries
Lncs, 2002
Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging
Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...
Computer Aided Systems Theory …, 2005
Lecture Notes in Computer Science, 2007
Lecture Notes in Computer Science, 2005
Construcci, on de sistemas de recuperaci, on de informaci, on sobre c, orpora textuales estructurados de grandes dimensiones
sepln.org
Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estru... more Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estructurados de grandes dimensiones * Fco. Mario Barcala Centro Ram on Pi neiro Santiago-Noia km. 3, A Barcia 36900 Santiago de Compostela barcala@freeresearch.org ...
Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging
Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...
Stochastic Parsing and Parallelism
. Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that ... more . Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that can be calculated simultaneously. Inthis work, we present a study on the appropriate techniques of parallelismto obtain an optimal performance of the extended CYK algorithm, astochastic parsing algorithm that preserves the same level of expressivenessas the one in the original grammar, and improves further
O Corpus de Referencia do Galego Actual (CORGA): estado actual e perspectivas
Lingua, pobo e terra: estudos en homenaxe a Xesús Ferro Ruibal, 2016, ISBN 978-84-453-5236-6, págs. 445-473, 2016
Caplletra. Revista Internacional de Filologia, 2020
El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...
El proyecto Gari-Coter en el seno del proyecto RICOTERM2
sepln.org, 2007
En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilit... more En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilite la construcción de córpora textuales estructurados basados en XML.
Procesamiento del lenguaje natural, 2003
En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas... more En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas a la normalización de términos en Recuperación de Información Textual. El objetivo de dichas técnicas es el tratamiento de los fenómenos de variación lingüística morfológica y léxica. En concreto explorará la utilización de la lematización, su empleo combinado con el stemming y la expansión de consultas mediante umbrales de sinonimia.
Lecture Notes in Computer Science, 2002
We present a reflection on the evolution of the different methods for constructing minimal determ... more We present a reflection on the evolution of the different methods for constructing minimal deterministic acyclic finite-state automata from a finite set of words. We outline the most important methods, including the traditional ones (which consist of the combination of two phases: insertion of words and minimization of the partial automaton) and the incremental algorithms (which add new words one by one and minimize the resulting automaton on-the-fly, being much faster and having significantly lower memory requirements). We analyze their main features in order to provide some improvements for incremental constructions, and a general architecture that is needed to implement large dictionaries in natural language processing (NLP) applications.
Lecture Notes in Computer Science, 2002
Lecture Notes in Computer Science, 2000
This article presents two new approaches for term indexing which are particularly appropriate for... more This article presents two new approaches for term indexing which are particularly appropriate for languages with a rich lexis and morphology, such as Spanish, and need few resources to be applied. At word level, productive derivational morphology is used to conflate semantically related words. At sentence level, an approximate grammar is used to conflate syntactic and morphosyntactic variants of a given multi-word term into a common base form. Experimental results show remarkable improvements with regard to classical indexing methods.
Codificación y anotación del habla en un contexto bilingüe: el corpus ESLORA de español en Galicia
Dialectología digital del español, 2020, ISBN 9788418445316, págs. 189-224, 2020
This article provides an overview of the design and composition of the corpus ESLORA and shows it... more This article provides an overview of the design and composition of the corpus ESLORA and shows its usefulness in analysing social and situational variation. The corpus also contributes to the study of the processes of change related to the geographical variation of Spanish, since it records its use in a region with its own distinctive language. This aspect facilitates the recognition of the Spanish spoken in Galicia as a research object in the field of Hispanic dialectology, allowing for its comparison with other geographical varieties on an equal footing. The main focus of the paper is on some difficulties encountered during transcription, codification, and annotation of spoken recordings as well as with the arguments that justify the solutions taken by the research team.
Caplletra. Revista Internacional de Filologia
El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...
CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, Oct 10, 2018
ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations r... more ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations recorded in Galicia between 2007 and 2015. The design and construction of the corpus meets three objectives: to register the use of a variety of Spanish which to date has been scarcely documented, to gain additional insight into the methods for the construction of spoken corpora, and to develop computational tools for corpus search. The paper presents the main characteristics of ESLORA and the criteria followed in the corpus building process. It also includes a brief description of the tools used to build the corpus and how they work together to achieve the project needs and, moreover, it shows that the decisions taken at various stages of the compilation of the corpus are closely related to the wide range of possibilities for retrieving the lexical, grammatical and contextual information provided by the materials.
Compilation methods of minimal acyclic automata for large dictionaries
Lncs, 2002
Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging
Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...
Computer Aided Systems Theory …, 2005
Lecture Notes in Computer Science, 2007
Lecture Notes in Computer Science, 2005
Construcci, on de sistemas de recuperaci, on de informaci, on sobre c, orpora textuales estructurados de grandes dimensiones
sepln.org
Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estru... more Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estructurados de grandes dimensiones * Fco. Mario Barcala Centro Ram on Pi neiro Santiago-Noia km. 3, A Barcia 36900 Santiago de Compostela barcala@freeresearch.org ...
Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging
Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...
Stochastic Parsing and Parallelism
. Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that ... more . Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that can be calculated simultaneously. Inthis work, we present a study on the appropriate techniques of parallelismto obtain an optimal performance of the extended CYK algorithm, astochastic parsing algorithm that preserves the same level of expressivenessas the one in the original grammar, and improves further