How to annotate morphologically rich learner language. Principles, problems and solutions (original) (raw)
Related papers
Detailed Error Annotation for Morphologically Rich Languages: Latvian Use Case
2020
This paper presents a detailed error annotation for morphologically rich languages. The described approach is used to create Latvian Language Learner corpus (LaVA) which is part of a currently ongoing project Development of Learner corpus of Latvian: methods, tools and applications. There is no need for an advanced multi-token error annotation schema, because error annotated texts are written by beginner level (A1 and A2) who use simple syntactic structures. This schema focuses on in-depth categorization of spelling and word formation errors. The annotation schema will work best for languages with relatively free word order and rich morphology.
Apples - Journal of Applied Language Studies, 2014
This paper introduces the Corpus of Advanced Learner Finnish (LAS2), one of the existing corpora of learner Finnish. The corpus was started at the University of Turku in 2007, and the initial motivation for its collection was to make it possible to deal with novel linguistic challenges posed by academic immigration and to contribute to corpus linguistics, Finnish linguistics and the study of second language acquisition. This paper describes the typological standpoint of the LAS2, its position with respect to other corpora of learner Finnish, the compilation criteria, the annotation applied and the workflow implemented. The corpus consists of three subcorpora of written academic texts of non-native speakers of Finnish. The subcorpora are 1) texts for examination purposes, 2) texts for publishing and graduating purposes, and 3) texts for studying and learning purposes. The informants either study or work in Finnish within academia in Finland. When available, the data has been collected longitudinally. A reference corpus for each subcorpus written by native speakers has also been compiled. Three query tools designed within the framework of the LAS2 are also introduced. These tools enable queries based on any combinations of the linguistic annotation. They can also be used to analyse the typical inner or cotextual variation of any user-specified linguistic node or to create frequency lists of multiword units defined at any level of the annotation. The queries can be limited to a user-specified subset of the data.
2018
CzeSL (Hana et al 2010, http://utkl.ff.cuni.cz/learncorp/) is a learner corpus of texts produced by nonnative speakers of Czech. Such corpora are a great source of information about specific features of learners’ language, helping language teachers and researchers in the area of second language acquisition. Each sentence in the CzeSL corpus has an error annotation and a target hypothesis with its morphological and syntactic annotation. However, there is no linguistic annotation of the original text. This means we can see what grammatical constructions the authors should have used but not what they actually used. And we can analyze their grammar only indirectly via the error annotation. For these reasons, in our project, we have focused on syntactic annotation of the non-native text within the framework of Universal Dependencies (http://universaldependencies.org). As far as we know, this is a first project annotating a richly inflectional non-native language. Our ideal goal has been ...
Error Tagging in the Lithuanian Learner Corpus
Human Language Technologies – The Baltic Perspective, 2020
This paper is a work-in-progress report on error annotation in the Lithuanian Learner Corpus (LLC), which has been developed using the TEITOK environment. The LLC is the first electronic corpus of learner Lithuanian that represents learners of very diverse native language backgrounds and different proficiency levels. In this paper, we have a double aim: firstly, we present the structure of the corpus in its current state; and secondly, we describe the main principles, procedures, and challenges of error annotation in the LLC. The main types of errors that are tagged in this corpus and analysed in this paper are orthographic, lexical, and syntactic.
The SweLL Language Learner Corpus: From Design to Annotation
The Northern European Journal of Language Technology
The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, – both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.