Russian Learner Corpus: Towards Error-Cause Annotation for L2 Russian (original) (raw)
Related papers
Russian Error-Annotated Learner English Corpus: a Tool for Computer-Assisted Language Learning
Proceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, Uppsala University, 2014
The paper describes the learner corpus composed of English essays written by native Russian speakers. REALEC (Russian Error-Annotated Learner English Corpus) is an error-annotated, available online corpus, now containing more than 200 thousand word tokens in almost 800 essays. It is one of the first Russian ESL corpora, dynamically developing and striving to improve both in size and in features offered to users. We describe our perspective on the corpus, data sources and tools used in compiling it. Elaborate self-made classification of learners’ errors types is thoroughly described. The paper also presents a pilot experiment on creating test sets for particular learners’ problems using corpus data.
Automatic error detection in Russian learner language
2013
Learner corpora, also known as interlanguage (IL) or second language (L2) corpora, have become increasingly popular resources in language research in the past decade. Learner corpora provide large volume of rich data for theoretical and applied language studies. Just as native (or L1) corpora, learner corpora are particularly useful for research when they are tagged; and learner corpora often contain tags that are more intricate than those found in L1 corpora. Metalinguistic tags, for instance, often contain information relevant both to the author of the text (language background, level, etc.) and the task (genre, format, time restriction, etc.). In regards to grammatical annotation, in addition to the usual lemmatisation and morphosyntactic tagging, L2 corpora may contain error-tags that provide information on deviant language use. Error-tagging is known to be a resourceconsuming and technologically-challenging task, more so for highly inflectional languages such as Russian, with i...
Classifying Syntactic Errors in Learner Language
Proceedings of CoNLL, 2020
We present a method for classifying syntactic errors in learner language, namely errors whose correction alters the morphosyntactic structure of a sentence. The methodology builds on the established Universal Dependencies syntactic representation scheme, and provides complementary information to other error-classification systems. Unlike existing error classification methods, our method is applicable across languages, which we showcase by producing a detailed picture of syntactic errors in learner English and learner Russian. We further demonstrate the utility of the methodology for analyzing the outputs of leading Grammatical Error Correction (GEC) systems.
Corpus of Russian student texts: design and prospects
2015
The Corpus of Russian Student Texts (CoRST) is a computational and research project started in 2013 at the Linguistic Laboratory for Corpora Research Technologies at HSE. It comprises a collection of Russian texts written by students from various Russian universities. Its main research goal is to examine language deviations viewed as markers of language change. CoRST is supplied with metalinguistic, morphological and error annotation that enable to customize subcorpora and search by various error types. Its error annotation is based on the modular classification: lexis, grammar and discourse, within which most frequent error phenomena are further distinguished. In total, the error classification encompasses 39 (20 higher-level and 19 lower-level) error tags. The crucial characteristic of CoRST is that the error annotation is multi-layered. Typically, since an error section can be corrected in a few ways, it is annotated with a few error tags respectively. Moreover, the corpus provid...
Annotating ESL errors: Challenges and rewards
2010
Abstract In this paper, we present a corrected and error-tagged corpus of essays written by non-native speakers of English. The corpus contains 63000 words and includes data by learners of English of nine first language backgrounds. The annotation was performed at the sentence level and involved correcting all errors in the sentence. Error classification includes mistakes in preposition and article usage, errors in grammar, word order, and word choice.