Mario Barcala - Profile on Academia.edu (original) (raw)

Papers by Mario Barcala

Research paper thumbnail of O Corpus de Referencia do Galego Actual (CORGA): estado actual e perspectivas

O Corpus de Referencia do Galego Actual (CORGA): estado actual e perspectivas

Lingua, pobo e terra: estudos en homenaxe a Xesús Ferro Ruibal, 2016, ISBN 978-84-453-5236-6, págs. 445-473, 2016

Research paper thumbnail of Computational tools and spoken corpora design: an ongoing dialogue

Caplletra. Revista Internacional de Filologia, 2020

El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...

Research paper thumbnail of El proyecto Gari-Coter en el seno del proyecto RICOTERM2

El proyecto Gari-Coter en el seno del proyecto RICOTERM2

sepln.org, 2007

Research paper thumbnail of Metodologıa para la construcción de córpora textuales estructurados basados en XML

En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilit... more En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilite la construcción de córpora textuales estructurados basados en XML.

Research paper thumbnail of El sistema ERIAL: LEIRA, un entorno para RI basado en PLN

Research paper thumbnail of Manejando la variación morfológica y léxica en la recuperación de información textual

Procesamiento del lenguaje natural, 2003

En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas... more En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas a la normalización de términos en Recuperación de Información Textual. El objetivo de dichas técnicas es el tratamiento de los fenómenos de variación lingüística morfológica y léxica. En concreto explorará la utilización de la lematización, su empleo combinado con el stemming y la expansión de consultas mediante umbrales de sinonimia.

Research paper thumbnail of Compilation Methods of Minimal Acyclic Finite-State Automata for Large Dictionaries

Lecture Notes in Computer Science, 2002

We present a reflection on the evolution of the different methods for constructing minimal determ... more We present a reflection on the evolution of the different methods for constructing minimal deterministic acyclic finite-state automata from a finite set of words. We outline the most important methods, including the traditional ones (which consist of the combination of two phases: insertion of words and minimization of the partial automaton) and the incremental algorithms (which add new words one by one and minimize the resulting automaton on-the-fly, being much faster and having significantly lower memory requirements). We analyze their main features in order to provide some improvements for incremental constructions, and a general architecture that is needed to implement large dictionaries in natural language processing (NLP) applications.

Research paper thumbnail of Formal Methods of Tokenization for Part-of-Speech Tagging

Lecture Notes in Computer Science, 2002

Research paper thumbnail of Using Syntactic Dependency-Pairs Conflation to Improve Retrieval Performance in Spanish

Lecture Notes in Computer Science, 2000

This article presents two new approaches for term indexing which are particularly appropriate for... more This article presents two new approaches for term indexing which are particularly appropriate for languages with a rich lexis and morphology, such as Spanish, and need few resources to be applied. At word level, productive derivational morphology is used to conflate semantically related words. At sentence level, an approximate grammar is used to conflate syntactic and morphosyntactic variants of a given multi-word term into a common base form. Experimental results show remarkable improvements with regard to classical indexing methods.

Research paper thumbnail of Codificación y anotación del habla en un contexto bilingüe: el corpus ESLORA de español en Galicia

Codificación y anotación del habla en un contexto bilingüe: el corpus ESLORA de español en Galicia

Dialectología digital del español, 2020, ISBN 9788418445316, págs. 189-224, 2020

This article provides an overview of the design and composition of the corpus ESLORA and shows it... more This article provides an overview of the design and composition of the corpus ESLORA and shows its usefulness in analysing social and situational variation. The corpus also contributes to the study of the processes of change related to the geographical variation of Spanish, since it records its use in a region with its own distinctive language. This aspect facilitates the recognition of the Spanish spoken in Galicia as a research object in the field of Hispanic dialectology, allowing for its comparison with other geographical varieties on an equal footing. The main focus of the paper is on some difficulties encountered during transcription, codification, and annotation of spoken recordings as well as with the arguments that justify the solutions taken by the research team.

Research paper thumbnail of Computational tools and spoken corpora design: an ongoing dialogue

Caplletra. Revista Internacional de Filologia

El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...

Research paper thumbnail of El corpus ESLORA de español oral: diseño, desarrollo y explotación

CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, Oct 10, 2018

ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations r... more ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations recorded in Galicia between 2007 and 2015. The design and construction of the corpus meets three objectives: to register the use of a variety of Spanish which to date has been scarcely documented, to gain additional insight into the methods for the construction of spoken corpora, and to develop computational tools for corpus search. The paper presents the main characteristics of ESLORA and the criteria followed in the corpus building process. It also includes a brief description of the tools used to build the corpus and how they work together to achieve the project needs and, moreover, it shows that the decisions taken at various stages of the compilation of the corpus are closely related to the wide range of possibilities for retrieving the lexical, grammatical and contextual information provided by the materials.

Research paper thumbnail of Compilation methods of minimal acyclic automata for large dictionaries

Compilation methods of minimal acyclic automata for large dictionaries

Lncs, 2002

Research paper thumbnail of Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...

Research paper thumbnail of Information retrieval and large text structured corpora

Computer Aided Systems Theory– …, 2005

Research paper thumbnail of XML Rules for Enclitic Segmentation

Lecture Notes in Computer Science, 2007

Research paper thumbnail of Information Retrieval and Large Text Structured Corpora

Lecture Notes in Computer Science, 2005

Research paper thumbnail of Construcci, on de sistemas de recuperaci, on de informaci, on sobre c, orpora textuales estructurados de grandes dimensiones

Construcci, on de sistemas de recuperaci, on de informaci, on sobre c, orpora textuales estructurados de grandes dimensiones

sepln.org

Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estru... more Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estructurados de grandes dimensiones * Fco. Mario Barcala Centro Ram on Pi neiro Santiago-Noia km. 3, A Barcia 36900 Santiago de Compostela barcala@freeresearch.org ...

Research paper thumbnail of Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...

Research paper thumbnail of Stochastic Parsing and Parallelism

Stochastic Parsing and Parallelism

. Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that ... more . Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that can be calculated simultaneously. Inthis work, we present a study on the appropriate techniques of parallelismto obtain an optimal performance of the extended CYK algorithm, astochastic parsing algorithm that preserves the same level of expressivenessas the one in the original grammar, and improves further

Research paper thumbnail of O Corpus de Referencia do Galego Actual (CORGA): estado actual e perspectivas

O Corpus de Referencia do Galego Actual (CORGA): estado actual e perspectivas

Lingua, pobo e terra: estudos en homenaxe a Xesús Ferro Ruibal, 2016, ISBN 978-84-453-5236-6, págs. 445-473, 2016

Research paper thumbnail of Computational tools and spoken corpora design: an ongoing dialogue

Caplletra. Revista Internacional de Filologia, 2020

El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...

Research paper thumbnail of El proyecto Gari-Coter en el seno del proyecto RICOTERM2

El proyecto Gari-Coter en el seno del proyecto RICOTERM2

sepln.org, 2007

Research paper thumbnail of Metodologıa para la construcción de córpora textuales estructurados basados en XML

En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilit... more En este trabajo analizamos los aspectos más relevantes para definir una metodología que posibilite la construcción de córpora textuales estructurados basados en XML.

Research paper thumbnail of El sistema ERIAL: LEIRA, un entorno para RI basado en PLN

Research paper thumbnail of Manejando la variación morfológica y léxica en la recuperación de información textual

Procesamiento del lenguaje natural, 2003

En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas... more En este artículo presentamos una serie de técnicas de Procesamiento de Lenguaje Natural aplicadas a la normalización de términos en Recuperación de Información Textual. El objetivo de dichas técnicas es el tratamiento de los fenómenos de variación lingüística morfológica y léxica. En concreto explorará la utilización de la lematización, su empleo combinado con el stemming y la expansión de consultas mediante umbrales de sinonimia.

Research paper thumbnail of Compilation Methods of Minimal Acyclic Finite-State Automata for Large Dictionaries

Lecture Notes in Computer Science, 2002

We present a reflection on the evolution of the different methods for constructing minimal determ... more We present a reflection on the evolution of the different methods for constructing minimal deterministic acyclic finite-state automata from a finite set of words. We outline the most important methods, including the traditional ones (which consist of the combination of two phases: insertion of words and minimization of the partial automaton) and the incremental algorithms (which add new words one by one and minimize the resulting automaton on-the-fly, being much faster and having significantly lower memory requirements). We analyze their main features in order to provide some improvements for incremental constructions, and a general architecture that is needed to implement large dictionaries in natural language processing (NLP) applications.

Research paper thumbnail of Formal Methods of Tokenization for Part-of-Speech Tagging

Lecture Notes in Computer Science, 2002

Research paper thumbnail of Using Syntactic Dependency-Pairs Conflation to Improve Retrieval Performance in Spanish

Lecture Notes in Computer Science, 2000

This article presents two new approaches for term indexing which are particularly appropriate for... more This article presents two new approaches for term indexing which are particularly appropriate for languages with a rich lexis and morphology, such as Spanish, and need few resources to be applied. At word level, productive derivational morphology is used to conflate semantically related words. At sentence level, an approximate grammar is used to conflate syntactic and morphosyntactic variants of a given multi-word term into a common base form. Experimental results show remarkable improvements with regard to classical indexing methods.

Research paper thumbnail of Codificación y anotación del habla en un contexto bilingüe: el corpus ESLORA de español en Galicia

Codificación y anotación del habla en un contexto bilingüe: el corpus ESLORA de español en Galicia

Dialectología digital del español, 2020, ISBN 9788418445316, págs. 189-224, 2020

This article provides an overview of the design and composition of the corpus ESLORA and shows it... more This article provides an overview of the design and composition of the corpus ESLORA and shows its usefulness in analysing social and situational variation. The corpus also contributes to the study of the processes of change related to the geographical variation of Spanish, since it records its use in a region with its own distinctive language. This aspect facilitates the recognition of the Spanish spoken in Galicia as a research object in the field of Hispanic dialectology, allowing for its comparison with other geographical varieties on an equal footing. The main focus of the paper is on some difficulties encountered during transcription, codification, and annotation of spoken recordings as well as with the arguments that justify the solutions taken by the research team.

Research paper thumbnail of Computational tools and spoken corpora design: an ongoing dialogue

Caplletra. Revista Internacional de Filologia

El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per c... more El disseny d’un corpus oral i els processos de registrar, codificar i tractar els materials per construir un recurs útil per a l’anàlisi lingüística, comporta nombroses decisions pel que fa a la teoria i la metodologia. Aquest article s’ocupa d’aquelles etapes de la construcció d’un corpus que més clarament estan condicionades pel processament informàtic necessari que ha de fer el corpus funcional. Per tal de conjugar les expectatives inicials i les possibilitats reals quan usem l’eina, cada característica que pretenem codificar ha de ser mesurada quant a la càrrega de treball que comporta i els mitjans que són requerits per fer-ho possible. Per això, és essencial tenir en compte els recursos disponibles a l’hora de processar i explotar el corpus, ja que tenen un impacte fonamental en les decisions pel que fa a la construcció del corpus. Basat en l’experiència adquirida en la construcció del corpus ESLORA, l’article analitza alguns dels problemes que sorgeixen en el procés de dissen...

Research paper thumbnail of El corpus ESLORA de español oral: diseño, desarrollo y explotación

CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, Oct 10, 2018

ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations r... more ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations recorded in Galicia between 2007 and 2015. The design and construction of the corpus meets three objectives: to register the use of a variety of Spanish which to date has been scarcely documented, to gain additional insight into the methods for the construction of spoken corpora, and to develop computational tools for corpus search. The paper presents the main characteristics of ESLORA and the criteria followed in the corpus building process. It also includes a brief description of the tools used to build the corpus and how they work together to achieve the project needs and, moreover, it shows that the decisions taken at various stages of the compilation of the corpus are closely related to the wide range of possibilities for retrieving the lexical, grammatical and contextual information provided by the materials.

Research paper thumbnail of Compilation methods of minimal acyclic automata for large dictionaries

Compilation methods of minimal acyclic automata for large dictionaries

Lncs, 2002

Research paper thumbnail of Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...

Research paper thumbnail of Information retrieval and large text structured corpora

Computer Aided Systems Theory– …, 2005

Research paper thumbnail of XML Rules for Enclitic Segmentation

Lecture Notes in Computer Science, 2007

Research paper thumbnail of Information Retrieval and Large Text Structured Corpora

Lecture Notes in Computer Science, 2005

Research paper thumbnail of Construcci, on de sistemas de recuperaci, on de informaci, on sobre c, orpora textuales estructurados de grandes dimensiones

Construcci, on de sistemas de recuperaci, on de informaci, on sobre c, orpora textuales estructurados de grandes dimensiones

sepln.org

Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estru... more Page 1. Construcci,on de sistemas de recuperaci,on de informaci,on sobre c,orpora textuales estructurados de grandes dimensiones * Fco. Mario Barcala Centro Ram on Pi neiro Santiago-Noia km. 3, A Barcia 36900 Santiago de Compostela barcala@freeresearch.org ...

Research paper thumbnail of Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Practical application of one-pass Viterbi algorithm in tokenization and part-of-speech tagging

Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tas... more Abstract Sentence word segmentation and Part-Of-Speech (POS) tagging are common preprocessing tasks for many Natural Language Processing (NLP) applications. This paper presents a practical application for POS tagging and segmentation disambiguation using ...

Research paper thumbnail of Stochastic Parsing and Parallelism

Stochastic Parsing and Parallelism

. Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that ... more . Parsing CYK-like algorithms are inherently parallel: thereare a lot of cells in the chart that can be calculated simultaneously. Inthis work, we present a study on the appropriate techniques of parallelismto obtain an optimal performance of the extended CYK algorithm, astochastic parsing algorithm that preserves the same level of expressivenessas the one in the original grammar, and improves further