Distributional analysis of Russian lexical errors (original) (raw)

Developing a Toolkit for Distributional Analysis of Abnormal Collocations in Russian

We propose a distributional approach to automatic correction of abnormal collocations in a Russian text corpus containing different types of erroneous word combinations, in particular, construction blending. We develop a toolkit which uses syntactic bigrams from RNC Sketches as training data and Word2Vec semantic model. A corpus of Russian Student Texts with annotation of erroneous word combinations, parsed morpho-syntactically with TreeTagger and MaltParser, was used in experiments. The annotated construction blending errors have been analyzed in terms of error correction by automatically proposing substitution candidates. The correction algorithm involves a set of association metrics based on context selectional preferences and semantic modeling, allowing to rank substitution candidates by their acceptability. Experimental results with nouns annotated as construction blending errors demonstrate the effectiveness of our toolkit. The results show that co-occurrence and Word2Vec semantic models perform ranking of the candidates in terms of different principles: purely con-structional and semantic. As a result, the use of Word2Vec semantic filtering improves the quality of error correction.

Classification of Lexical Collocation Errors in the Writings of Learners of Spanish

2015

It is generally acknowledged that collocations in the sense of idiosyncratic word cooccurrences are a challenge in the context of second language learning. Advanced miscollocation correction is thus highly desirable. However, state-of-the-art “collocation checkers” are merely able to detect a possible miscollocation and then offer as correction suggestion a list of collocations of the given keyword retrieved automatically from a corpus. No more targeted correction is possible since state-ofthe-art collocation checkers are not able to identify the type of the miscollocation. We suggest a classification of the main types of lexical miscollocations by US American learners of Spanish and demonstrate its performance.

Syntagmatic Relations in Russian Corpora and Dictionaries

Pragmantax II. Zum aktuellen Stand der Linguistik und ihren Teildisziplinen. The Present State of Linguistics and its Sub-Disciplines. Frankfurt a.M.: Peter Lang, 2014. S. 333-344.

The paper describes a notion of collocability and collocations, statistical background for collocation extraction and experiments of applying statistical tools in order to extract collocations from Russian texts.

Automatic error detection in Russian learner language

2013

Learner corpora, also known as interlanguage (IL) or second language (L2) corpora, have become increasingly popular resources in language research in the past decade. Learner corpora provide large volume of rich data for theoretical and applied language studies. Just as native (or L1) corpora, learner corpora are particularly useful for research when they are tagged; and learner corpora often contain tags that are more intricate than those found in L1 corpora. Metalinguistic tags, for instance, often contain information relevant both to the author of the text (language background, level, etc.) and the task (genre, format, time restriction, etc.). In regards to grammatical annotation, in addition to the usual lemmatisation and morphosyntactic tagging, L2 corpora may contain error-tags that provide information on deviant language use. Error-tagging is known to be a resourceconsuming and technologically-challenging task, more so for highly inflectional languages such as Russian, with i...

Paronyms for Accelerated Correction of Semantic Errors

2003

The errors usually made by authors during text preparation are classified. The notion of semantic errors is elaborated, and malapropisms are pointed among them as “similar” to the intended word but essentially distorting the meaning of the text. For whatever method of malapropism correction, we propose to beforehand compile dictionaries of paronyms, ie of words similar to each other in letters, sounds or morphs.

On correction of semantic errors in natural language texts with a dictionary of literal paronyms

2004

Due to the open nature of the Web, search engines must include means of meaningful processing of incorrect texts, including automatic error detection and correction. One of wide-spread types of errors in Internet texts are malapropisms, ie, semantic errors replacing a word by another existing word similar in letter composition and/or sound but semantically incompatible with the context. Methods for detection and correction of malapropisms have been proposed recently.

On detection of malapropisms by multistage collocation testing

NLDB-2003, 8th Int. Conf. on Application of …, 2003

Malapropism is a (real-word) error in a text consisting in unintended replacement of one content word by another existing content word similar in sound but semantically incompatible with the context and thus destructing text cohesion, e.g.: they travel around the word. We present an algorithm of malapropism detection and correction based on evaluating the cohesion. As a measure of semantic compatibility of words we consider their ability to form syntactically linked and semantically admissible word combinations (collocations), e.g: travel (around the) world. With this, text cohesion at a content word is measured as the number of collocations it forms with the words in its immediate context. We detect malapropisms as words forming no collocations in the context. To test whether two words can form a collocation, we consider two types of resources: a collocation DB and an Internet search engine, e.g., Google. We illustrate the proposed method by classifying, tracing, and evaluating several English malapropisms. * Work done under partial support of Mexican Government (CONACyT, SNI), CGEPI-IPN, Mexico, and RITOS-2. We thank Prof. G. Hirst for useful discussion and criticism.

A Language Model for Grammatical Error Correction in L2 Russian

arXiv (Cornell University), 2023

Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, we propose a pipeline involving a language model intended for correcting errors in L2 Russian writing. The language model proposed is trained on untagged texts of the Newspaper subcorpus of the Russian National Corpus, and the quality of the model is validated against the RULEC-GEC corpus.

Automatic Identification of English Collocation Errors based on Dependency Relations

2013

We present an English miscollocation identification system based on dependency relations drawn from the Stanford parser. We test our system against a subset of error-tagged Chinese Learner English Corpus (CLEC)and obtain an overall precision of 0.75. We describe some applications and limitations of our system and suggest directions for future research.