Iria del Río Gayo | University of Lisbon (original) (raw)
Papers by Iria del Río Gayo
Proces. del Leng. Natural, 2014
The great amount of text produced every day in the Web turned it as one of the main sources for o... more The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed with Natural Language Processing techniques. On a global scale, languages such as Portuguese âofficial in 9 countries- appear on the Web in several varieties, with lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as...
We present the error tagging system of the COPLE2 corpus and the first results of its implementat... more We present the error tagging system of the COPLE2 corpus and the first results of its implementation.. The system takes advantage of the corpus architecture and the possibilities of the TEITOK environment to reduce manual effort and produce a final standoff, multilevel annotation with position-based tags that account for the main error types observed in the corpus. The first step of the tagging process involves the manual annotation of errors at the token level. We have already annotated 47% of the corpus using this approach. In a further step, the token-based annotations will be automatically transformed (fully or partially) in position-based error tags. COPLE2 is the first Portuguese learner corpus with error annotation. We expect that this work will support new research in different fields connected with Portuguese as second/foreign language, like Second Language Acquisition/Teaching or Computer Assisted
We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs ... more We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs of discourse marker/rhetorical sense. The lexicon covers conjunctions, prepositions, adverbs, adverbial phrases and alternative lexicalizations with a connective function, as in the PDTB (Prasad et al., 2008; Prasad et al., 2010). For each discourse marker in the lexicon, there is information regarding its type, category, mood and tense restrictions over the sentence it introduces, rhetorical sense, following the PDTB 3.0 sense hierarchy (Webber et al., 2016), as well as a link to an English near-synonym and a corpus example. The lexicon is compiled in a single excel spread sheet that is later converted to an XML scheme compatible with the DiMLex format (Stede, 2002). We give a detailed description of the contents and format of the lexicon, and discuss possible applications of this resource for discourse studies and discourse processing tools for Portuguese.
We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs ... more We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs of discourse marker/rhetorical sense. The lexicon covers conjunctions, prepositions, adverbs, adverbial phrases and alternative lexicalizations with a connective function, as in the PDTB (Prasad et al., 2008; Prasad et al., 2010). For each discourse marker in the lexicon, there is information regarding its type, category, mood and tense restrictions over the sentence it introduces, rhetorical sense, following the PDTB 3.0 sense hierarchy (Webber et al., 2016), as well as a link to an English near-synonym and a corpus example. The lexicon is compiled in a single excel spread sheet that is later converted to an XML scheme compatible with the DiMLex format (Stede, 2002). We give a detailed description of the contents and format of the lexicon, and discuss possible applications of this resource for discourse studies and discourse processing tools for Portuguese.
In this paper we present NLI-PT, the first Por-tuguese dataset compiled for Native Language Ident... more In this paper we present NLI-PT, the first Por-tuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.
We present the error tagging system of the COPLE2 corpus and the first results of its implementat... more We present the error tagging system of the COPLE2 corpus and the first results of its implementation.. The system takes advantage of the corpus architecture and the possibilities of the TEITOK environment to reduce manual effort and produce a final standoff, multi-level annotation with position-based tags that account for the main error types observed in the corpus. The first step of the tagging process involves the manual annotation of errors at the token level. We have already annotated 47% of the corpus using this approach. In a further step, the token-based annotations will be automatically transformed (fully or partially) in position-based error tags. COPLE2 is the first Portuguese learner corpus with error annotation. We expect that this work will support new research in different fields connected with Portuguese as second/foreign language, like Second Language Acquisition/Teaching or Computer Assisted Learning.
In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoke... more In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoken data produced by foreign learners of Portuguese as a foreign or second language (FL/L2). Following the trend towards learner corpus research applied to less commonly taught languages, it is our aim to enhance the learning data of Portuguese L2. These data may be useful not only for educational purposes (design of learning materials, curricula, etc.) but also for the development of NLP tools to support students in their learning process. The corpus is available online using TEITOK environment, a web-based framework for corpus treatment that provides several built-in NLP tools and a rich set of func-tionalities (multiple orthographic transcription layers, lemmatization and POS, normalization of the tokens, error annotation) to automatically process and annotate texts in xml format. A CQP-based search interface allows searching the corpus for different fields, such as words, lemmas, POS tags or error tags. We will describe the work in progress regarding the constitution and linguistic annotation of this corpus, particularly focusing on error annotation .
En esta tesis doctoral se realiza un estudio lingüístico de las preguntas en español y se describ... more En esta tesis doctoral se realiza un estudio lingüístico de las preguntas en español y se describe SpQA, un analizador sintáctico-semántico desarrollado para analizar preguntas en un sistema de Búsqueda de Respuestas que se basa en dicho estudio. SpQA realiza un análisis lingüístico profundo de las preguntas, extrayendo automáticamente información clave para el módulo de análisis de las preguntas de un sistema de Búsqueda de Respuetas.
The objective of this article is to present an automatic tool for detecting and classifying gramm... more The objective of this article is to present an automatic tool for detecting and classifying grammatical errors in written language as well as to describe the evaluation protocol we have carried out to measure its performance on learner corpora. The tool was designed to detect and analyse the linguistic errors found in text essays, assess the writing proficiency, and propose solutions with the aim of improving the linguistic skills of students. It makes use of natural language processing and knowledge-rich linguistic resources. So far, the tool has been implemented for the Galician language. The system has been evaluated on two learner corpora reaching 91% precision and 65% recall (76% F-score) for the task of detecting different types of grammatical errors, including spelling, lexical and syntactic ones.
Procesamiento del Lenguaje Natural, 53, Sep 1, 2014
Abstract: The great amount of text produced every day in the Web turned it as one of the main sou... more Abstract: The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed
with Natural Language Processing techniques. On a global scale, languages such as Portuguese - official in 9 countries - appear on the Web in several varieties, with
lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation
process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as a new freely available testing corpus, containing different varieties and textual typologies.
Keywords: PoS-tagging, Portuguese, Web as Corpus, Spelling Agreement.
The Eurocall Review, Mar 1, 2013
Procesamiento del Lenguaje Natural, 49, 2012
Abstract: In this paper we present a first approximation to the automatic detection of zero subje... more Abstract: In this paper we present a first approximation to the automatic detection of zero subjects and impersonal constructions in Brazilian Portuguese. To the best of our knowledge, this is the first attempt of approaching such task using machine learning in Portuguese. We compiled a corpus containing more than 5,600 instances annotated with the classes to be identified: explicit subjects, zero subjects
or pronouns and impersonal constructions. We applied machine learning using linguistically motivated features to classify the instances. The results are modest but
promising and provide guidance for future work.
Keywords: subject ellipsis, impersonal construction, zero pronoun, null subject, machine learning
Estudos de Lingüística Galega, 4, 2012
Automatic named entity recognition and classification are important tasks for many natural langua... more Automatic named entity recognition and classification are important tasks for many natural language processing applications, such as machine translation, information extraction or question-answering systems.
This paper describes the adaptation and implementation
of several open-source systems for the identification and classification of the following named entities in Galician:
(i) dates,
(ii) numerals,
(iii) quantities
and (iv) proper nouns.
Analysis of the first three types of named entities is performed
with the software FreeLing, using finite-state automata.
For the proper noun recognition task, two methods
were compared: (i) finite-state automata and (ii) machine learning models. Finally, the semantic classification of proper
nouns was carried out with a rulebased system
that takes advantage of automatically obtained resources.
This paper shows some evaluations for each tool, all available
under free licenses.
Abstract This paper presents a comparable corpus of Portuguese and Spanish consisting of legal an... more Abstract
This paper presents a comparable corpus of Portuguese and Spanish consisting of legal and health texts. We describe the annotation of zero subject, impersonal constructions and explicit subjects in the corpus. We annotated 12,492 examples using a scheme that distinguishes between different linguistic levels (phonology, syntax, semantics, etc.) and present a taxonomy of instances on which annotators disagree. The high level of inter-annotator agreement (83%–95%) and the performance of learning algorithms trained on the
corpus show that our corpus is a reliable and useful resource.
Question processing is a key step in Question Answering systems. For this task, it has been show... more Question processing is a key step in Question Answering
systems. For this task, it has been shown
that a good syntactic analysis of questions helps to
improve the results. However, general parsers seem
to present some disadvantages in question analysis.
We present a specific tool under development for
Spanish question analysis in a QA context: SpQA.
SpQA is a parser designed to deal with the special
syntactic features of Spanish questions and to
cover some needs of question analysis in QA systems
such as target identification. The system has
been evaluated together with three Spanish general
parsers. In this comparative evaluation, SpQA
shows the best results in Spanish question analysis.
Abstract: Syntactic analysis of questions is a crucial step of Question Answering systems (QA). ... more Abstract: Syntactic analysis of questions is a crucial step of Question Answering systems (QA).
There are free Spanish parsers for this task. In this paper, we ask about the possibility of using
these parsers for question analysis, a specific task of QA. For this reason, we propose an
evaluation of three Spanish free parsers: two dependency parsers and one constituency parser.
The evaluation accounts only for question syntactic analysis. Our results show that the three
parsers show good results in constituent identification, but they have worst results in function
labelling. We show also that, as it has been shown for English, constituency parsers perform
better than dependency parsers in question analysis.
Keywords: parser; Question Answering; question; syntactic analysis; parsing evaluation.
Abstract: Spanish and Portuguese are pro-drop languages. However, the differences between ... more Abstract: Spanish and Portuguese are pro-drop languages. However, the differences between some varieties of Portuguese have led researchers to consider Brazilian Portuguese as partial pro-drop language. This paper explores the pro-drop phenomenon in Brazilian Portuguese and compares it to Iberian Spanish using comparable corpora. Our results discuss the differences found between these two languages with a special focus on the syntactic possibilities of subject realization.
Keywords: pro-drop; comparable corpora; explicit subject; null subject; impersonal construction; reflex passive.
Lingüística XL. El lingüista del siglo XXI 291 I. ¿POR QUÉ LA ELIPSIS?
Presentamos un estudio sintáctico de un tipo de cláusula interrogativa del español (interrogativa... more Presentamos un estudio sintáctico de un tipo de cláusula interrogativa del español (interrogativa directa parcial con un objeto) basado fundamentalmente en datos de corpus. El trabajo se enmarca en una investigación que ha tenido como objetivo la formalización de este tipo de cláusula para la construcción de un módulo que se integra en la gramática formal de propósito general Avalon. Tres son los aspectos de la sintaxis de estas cláusulas de los que nos ocuparemos: el orden de constituyentes, la partícula interrogativa y el número de argumentos. La elección de estos tres aspectos responde a motivaciones diferentes: los dos primeros por ser los rasgos más característicos de la sintaxis de estas cláusulas; el tercero, número de argumentos, por ser un perfecto ejemplo del tipo de dato lingüístico que nos puede ofrecer el corpus y no la teoría. Palabras clave: cláusula interrogativa, corpus, orden de constituyentes, partícula interrogativa, número de argumentos Abstract We present a syntactic study of a particular type of Spanish interrogative clause, mainly based in corpus data. This work is part of a research project whose main goal is the formalization of these interrogative clauses for the construction of the module that gives account for them in the general purpose formal grammar Avalon. We analyze three syntactic aspects of these clauses: order of constituents, interrogative particle and number of arguments. We have chosen these three aspects for different reasons: both first ones because they are the syntactic aspects most typical of this type of clauses; the last one, number of arguments, because it is the perfect case of linguistic data that we can obtain from a corpus and no from the theory.
Proces. del Leng. Natural, 2014
The great amount of text produced every day in the Web turned it as one of the main sources for o... more The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed with Natural Language Processing techniques. On a global scale, languages such as Portuguese âofficial in 9 countries- appear on the Web in several varieties, with lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as...
We present the error tagging system of the COPLE2 corpus and the first results of its implementat... more We present the error tagging system of the COPLE2 corpus and the first results of its implementation.. The system takes advantage of the corpus architecture and the possibilities of the TEITOK environment to reduce manual effort and produce a final standoff, multilevel annotation with position-based tags that account for the main error types observed in the corpus. The first step of the tagging process involves the manual annotation of errors at the token level. We have already annotated 47% of the corpus using this approach. In a further step, the token-based annotations will be automatically transformed (fully or partially) in position-based error tags. COPLE2 is the first Portuguese learner corpus with error annotation. We expect that this work will support new research in different fields connected with Portuguese as second/foreign language, like Second Language Acquisition/Teaching or Computer Assisted
We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs ... more We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs of discourse marker/rhetorical sense. The lexicon covers conjunctions, prepositions, adverbs, adverbial phrases and alternative lexicalizations with a connective function, as in the PDTB (Prasad et al., 2008; Prasad et al., 2010). For each discourse marker in the lexicon, there is information regarding its type, category, mood and tense restrictions over the sentence it introduces, rhetorical sense, following the PDTB 3.0 sense hierarchy (Webber et al., 2016), as well as a link to an English near-synonym and a corpus example. The lexicon is compiled in a single excel spread sheet that is later converted to an XML scheme compatible with the DiMLex format (Stede, 2002). We give a detailed description of the contents and format of the lexicon, and discuss possible applications of this resource for discourse studies and discourse processing tools for Portuguese.
We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs ... more We present LDM-PT, a lexicon of discourse markers for European Portuguese, composed of 252 pairs of discourse marker/rhetorical sense. The lexicon covers conjunctions, prepositions, adverbs, adverbial phrases and alternative lexicalizations with a connective function, as in the PDTB (Prasad et al., 2008; Prasad et al., 2010). For each discourse marker in the lexicon, there is information regarding its type, category, mood and tense restrictions over the sentence it introduces, rhetorical sense, following the PDTB 3.0 sense hierarchy (Webber et al., 2016), as well as a link to an English near-synonym and a corpus example. The lexicon is compiled in a single excel spread sheet that is later converted to an XML scheme compatible with the DiMLex format (Stede, 2002). We give a detailed description of the contents and format of the lexicon, and discuss possible applications of this resource for discourse studies and discourse processing tools for Portuguese.
In this paper we present NLI-PT, the first Por-tuguese dataset compiled for Native Language Ident... more In this paper we present NLI-PT, the first Por-tuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.
We present the error tagging system of the COPLE2 corpus and the first results of its implementat... more We present the error tagging system of the COPLE2 corpus and the first results of its implementation.. The system takes advantage of the corpus architecture and the possibilities of the TEITOK environment to reduce manual effort and produce a final standoff, multi-level annotation with position-based tags that account for the main error types observed in the corpus. The first step of the tagging process involves the manual annotation of errors at the token level. We have already annotated 47% of the corpus using this approach. In a further step, the token-based annotations will be automatically transformed (fully or partially) in position-based error tags. COPLE2 is the first Portuguese learner corpus with error annotation. We expect that this work will support new research in different fields connected with Portuguese as second/foreign language, like Second Language Acquisition/Teaching or Computer Assisted Learning.
In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoke... more In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoken data produced by foreign learners of Portuguese as a foreign or second language (FL/L2). Following the trend towards learner corpus research applied to less commonly taught languages, it is our aim to enhance the learning data of Portuguese L2. These data may be useful not only for educational purposes (design of learning materials, curricula, etc.) but also for the development of NLP tools to support students in their learning process. The corpus is available online using TEITOK environment, a web-based framework for corpus treatment that provides several built-in NLP tools and a rich set of func-tionalities (multiple orthographic transcription layers, lemmatization and POS, normalization of the tokens, error annotation) to automatically process and annotate texts in xml format. A CQP-based search interface allows searching the corpus for different fields, such as words, lemmas, POS tags or error tags. We will describe the work in progress regarding the constitution and linguistic annotation of this corpus, particularly focusing on error annotation .
En esta tesis doctoral se realiza un estudio lingüístico de las preguntas en español y se describ... more En esta tesis doctoral se realiza un estudio lingüístico de las preguntas en español y se describe SpQA, un analizador sintáctico-semántico desarrollado para analizar preguntas en un sistema de Búsqueda de Respuestas que se basa en dicho estudio. SpQA realiza un análisis lingüístico profundo de las preguntas, extrayendo automáticamente información clave para el módulo de análisis de las preguntas de un sistema de Búsqueda de Respuetas.
The objective of this article is to present an automatic tool for detecting and classifying gramm... more The objective of this article is to present an automatic tool for detecting and classifying grammatical errors in written language as well as to describe the evaluation protocol we have carried out to measure its performance on learner corpora. The tool was designed to detect and analyse the linguistic errors found in text essays, assess the writing proficiency, and propose solutions with the aim of improving the linguistic skills of students. It makes use of natural language processing and knowledge-rich linguistic resources. So far, the tool has been implemented for the Galician language. The system has been evaluated on two learner corpora reaching 91% precision and 65% recall (76% F-score) for the task of detecting different types of grammatical errors, including spelling, lexical and syntactic ones.
Procesamiento del Lenguaje Natural, 53, Sep 1, 2014
Abstract: The great amount of text produced every day in the Web turned it as one of the main sou... more Abstract: The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed
with Natural Language Processing techniques. On a global scale, languages such as Portuguese - official in 9 countries - appear on the Web in several varieties, with
lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation
process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as a new freely available testing corpus, containing different varieties and textual typologies.
Keywords: PoS-tagging, Portuguese, Web as Corpus, Spelling Agreement.
The Eurocall Review, Mar 1, 2013
Procesamiento del Lenguaje Natural, 49, 2012
Abstract: In this paper we present a first approximation to the automatic detection of zero subje... more Abstract: In this paper we present a first approximation to the automatic detection of zero subjects and impersonal constructions in Brazilian Portuguese. To the best of our knowledge, this is the first attempt of approaching such task using machine learning in Portuguese. We compiled a corpus containing more than 5,600 instances annotated with the classes to be identified: explicit subjects, zero subjects
or pronouns and impersonal constructions. We applied machine learning using linguistically motivated features to classify the instances. The results are modest but
promising and provide guidance for future work.
Keywords: subject ellipsis, impersonal construction, zero pronoun, null subject, machine learning
Estudos de Lingüística Galega, 4, 2012
Automatic named entity recognition and classification are important tasks for many natural langua... more Automatic named entity recognition and classification are important tasks for many natural language processing applications, such as machine translation, information extraction or question-answering systems.
This paper describes the adaptation and implementation
of several open-source systems for the identification and classification of the following named entities in Galician:
(i) dates,
(ii) numerals,
(iii) quantities
and (iv) proper nouns.
Analysis of the first three types of named entities is performed
with the software FreeLing, using finite-state automata.
For the proper noun recognition task, two methods
were compared: (i) finite-state automata and (ii) machine learning models. Finally, the semantic classification of proper
nouns was carried out with a rulebased system
that takes advantage of automatically obtained resources.
This paper shows some evaluations for each tool, all available
under free licenses.
Abstract This paper presents a comparable corpus of Portuguese and Spanish consisting of legal an... more Abstract
This paper presents a comparable corpus of Portuguese and Spanish consisting of legal and health texts. We describe the annotation of zero subject, impersonal constructions and explicit subjects in the corpus. We annotated 12,492 examples using a scheme that distinguishes between different linguistic levels (phonology, syntax, semantics, etc.) and present a taxonomy of instances on which annotators disagree. The high level of inter-annotator agreement (83%–95%) and the performance of learning algorithms trained on the
corpus show that our corpus is a reliable and useful resource.
Question processing is a key step in Question Answering systems. For this task, it has been show... more Question processing is a key step in Question Answering
systems. For this task, it has been shown
that a good syntactic analysis of questions helps to
improve the results. However, general parsers seem
to present some disadvantages in question analysis.
We present a specific tool under development for
Spanish question analysis in a QA context: SpQA.
SpQA is a parser designed to deal with the special
syntactic features of Spanish questions and to
cover some needs of question analysis in QA systems
such as target identification. The system has
been evaluated together with three Spanish general
parsers. In this comparative evaluation, SpQA
shows the best results in Spanish question analysis.
Abstract: Syntactic analysis of questions is a crucial step of Question Answering systems (QA). ... more Abstract: Syntactic analysis of questions is a crucial step of Question Answering systems (QA).
There are free Spanish parsers for this task. In this paper, we ask about the possibility of using
these parsers for question analysis, a specific task of QA. For this reason, we propose an
evaluation of three Spanish free parsers: two dependency parsers and one constituency parser.
The evaluation accounts only for question syntactic analysis. Our results show that the three
parsers show good results in constituent identification, but they have worst results in function
labelling. We show also that, as it has been shown for English, constituency parsers perform
better than dependency parsers in question analysis.
Keywords: parser; Question Answering; question; syntactic analysis; parsing evaluation.
Abstract: Spanish and Portuguese are pro-drop languages. However, the differences between ... more Abstract: Spanish and Portuguese are pro-drop languages. However, the differences between some varieties of Portuguese have led researchers to consider Brazilian Portuguese as partial pro-drop language. This paper explores the pro-drop phenomenon in Brazilian Portuguese and compares it to Iberian Spanish using comparable corpora. Our results discuss the differences found between these two languages with a special focus on the syntactic possibilities of subject realization.
Keywords: pro-drop; comparable corpora; explicit subject; null subject; impersonal construction; reflex passive.
Lingüística XL. El lingüista del siglo XXI 291 I. ¿POR QUÉ LA ELIPSIS?
Presentamos un estudio sintáctico de un tipo de cláusula interrogativa del español (interrogativa... more Presentamos un estudio sintáctico de un tipo de cláusula interrogativa del español (interrogativa directa parcial con un objeto) basado fundamentalmente en datos de corpus. El trabajo se enmarca en una investigación que ha tenido como objetivo la formalización de este tipo de cláusula para la construcción de un módulo que se integra en la gramática formal de propósito general Avalon. Tres son los aspectos de la sintaxis de estas cláusulas de los que nos ocuparemos: el orden de constituyentes, la partícula interrogativa y el número de argumentos. La elección de estos tres aspectos responde a motivaciones diferentes: los dos primeros por ser los rasgos más característicos de la sintaxis de estas cláusulas; el tercero, número de argumentos, por ser un perfecto ejemplo del tipo de dato lingüístico que nos puede ofrecer el corpus y no la teoría. Palabras clave: cláusula interrogativa, corpus, orden de constituyentes, partícula interrogativa, número de argumentos Abstract We present a syntactic study of a particular type of Spanish interrogative clause, mainly based in corpus data. This work is part of a research project whose main goal is the formalization of these interrogative clauses for the construction of the module that gives account for them in the general purpose formal grammar Avalon. We analyze three syntactic aspects of these clauses: order of constituents, interrogative particle and number of arguments. We have chosen these three aspects for different reasons: both first ones because they are the syntactic aspects most typical of this type of clauses; the last one, number of arguments, because it is the perfect case of linguistic data that we can obtain from a corpus and no from the theory.