Olga Lyashevskaya | National Research University Higher School of Economics (original) (raw)
Papers by Olga Lyashevskaya
NSU Vestnik. Series: Linguistics and Intercultural Communication
Corpora-based language studies is a widespread practice in modern linguistics. In our study, we a... more Corpora-based language studies is a widespread practice in modern linguistics. In our study, we address so-called mass literature (or paraliterature) via a corpus of texts. The standardization of mass literature allows us to describe its genres by applying literary formulas. In brief, a formula serves for the embodiment of cultural themes and stereotypes in a universal form. While mass literature is a common subject of literary and cultural studies, from a linguistic point of view, literary formulas have not been studied well enough. We suggest that differences between the microgenres of paraliterature may be in syntax as well as in the vocabulary. Our work is based on the mass literature corpora and provides analysis of verb constructions characteristic of microgenres (love story, detective story, science fiction novel, and fantasy). In order to identify the distinctive features of mass literature microgenres, we have conducted a series of machine learning experiments. As a dataset...
Universal Dependencies Consortium, Nov 15, 2020
Quantitative Approaches to the Russian Language, 2017
Communications in Computer and Information Science, 2018
Cross-tagset parsing is based on the substitution of one annotation layer for another while proce... more Cross-tagset parsing is based on the substitution of one annotation layer for another while processing data within one language. As often as not, either the native tagger or the dependency parser used in (pre-)annotation of the Gold treebank is not available. The cross-tagset approach allows one to annotate new texts using freely available tools or tools optimized to user's needs. We evaluate the robustness of Russian dependency parsing using different morphological and syntactic tagsets in input and output. Qualitative analysis of errors shows that the crosssubstitution of three morphological tagsets and two syntactic tagsets causes only a mild drop in performance.
Communications in Computer and Information Science, 2019
The poetic texts pose a challenge to full morphological tagging and lemmatization since the autho... more The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic taggers based on decision trees, CRF and neural network algorithms as well as a state-of-the-art dictionary-based tagger. The taggers were trained on prosaic texts and tested on three poetic samples of different complexity. Firstly, we suggest a method to compile the gold standard datasets for the Russian poetry. Secondly, we focus on the taggers’ performance in the identification of the part of speech tags and lemmas. We reveal what kind of POS classes, paradigm classes and syntactic patterns mostly affect the quality of processing.
Zaliznjak & Mikaèljan 2014 is a critique of the model of Russian aspect found in Janda 2012 and J... more Zaliznjak & Mikaèljan 2014 is a critique of the model of Russian aspect found in Janda 2012 and Janda et al. 2013. In this rebuttal I give a brief overview of my model of Russian aspect and then address the criticisms made by Zaliznjak & Mikaèljan. I begin by examining the four assumptions stated by Zaliznjak & Mikaèljan, which I find to be unnecessary and lacking in theoretical gounding. Their assumption that aspectual correlation is uniformly directed from perfective to imperfective is particularly problematic. I compare Zaliznjak & Mikaèljan’s assumption with the single assumption my work is based on, namely that linguistic cognition is not fundamentally different from general cognition, and present the entailments of this assumption and what they mean for an investigation of Russian aspect. I then present four further problems with Zaliznjak & Mikaèljan’s model of Russian aspect: the alleged transfer of meaning from perfective to imperfective, the criteria for identifying protot...
This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 inst... more This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 instead.
RU-EVAL is a biennial event organized in order to estimate the state of the art in Russian NLP re... more RU-EVAL is a biennial event organized in order to estimate the state of the art in Russian NLP resources, methods and toolkits and to compare various methods and principles implemented for Russian. Russian could be treated as an under-resourced language due to the lack of free distributable gold standard corpora for different NLP tasks (each team tried to work out their own standards). Thus, our goal was to work out the uniform basis for comparison of systems based on different theoretical and engineering approaches, to build evaluation resources, to provide a flexible system of evaluation in order to differentiate between non-acceptable and linguistically “admissible” errors. The paper reports on three events devoted to morphological tagging, dependency parsing and anaphora resolution, respectively.
In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) tre... more In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) treebanks for Russian: UD_Russian-SynTagRus, UD_Russian-GSD, UD_Russian-Taiga, and UD_Russian-PUD. We describe the four treebanks, their distinctive features and development. In order to test and improve consistency within the treebanks, we reconsidered the experiments by Martínez Alonso and Zeman; our parsing experiments were conducted using a state-of-the-art parser that took part in the CoNLL 2017 Shared Task. We analyze error classes in functional and content relations and discuss a method to separate the errors induced by annotation inconsistency and those caused by syntactic complexity and other factors.
This paper describes the distribution of colour adjectives in Russian poetry of the Silver Age an... more This paper describes the distribution of colour adjectives in Russian poetry of the Silver Age and defines individual preferences with regard to poetic tradition, syllable structure, and metrical restrictions. The research method combines a lexico-semantic approach, formal literary analysis, and quantitative metrics obtained via the frequency database of the Russian Poetry Corpus (over 10 M words, incl. 1 M adjectives). The database allows the user to compare subcorpora and create graphs of timeline distribution, which demonstrate that the lexical diversity and relative frequencies of colour adjectives start to grow rapidly in the 1890s, as modernists employ colour adjectives to upgrade the poetic inventory. The adjectives referring to non-banal hues (e.g. fioletovyj ‘violet’, lazorevyj ‘azur’) belong to the middle part of the ranked wordlist. Correspondence analysis of the data reveals individual colour preferences and stylistic similarities among the most prominent poets of the Si...
The task of the semantic role labeling usually focuses on identifying and classifying the core, o... more The task of the semantic role labeling usually focuses on identifying and classifying the core, obligatory argu ments of the predicate. The adjuncts of Time, Location, etc. (noncore, modifier arguments) are considered on the periphery of the task [30] and even doing the easy part of it [44], despite the fact that they are highly integrated into the clause structure and may nontrivially interact with the meaning of the verb [4, 32]. In this paper, we present experiments on labeling the adjunct roles of LOCATION, TIME, MANNER, DEGREE, REASON, and PURPOSE, based on the manually annotated AdjunctsFrameBank data set. The results show an average F1score of 0.94 on the gold adjunct phrase annotations using the word2vec representations of adjuncts, word2vec representations of pre dicates, and the moprhosyntactic marking of adjuncts. Our findings generally corroborate the theoretical hypothesis on the structural and semantic autonomy and lexicomorphosyntactic specialization of adjuncts. Yet, more com plicated organization of their network is revealed, pointing to the diversity of adjuncts in terms of their distribution and behavior.
Russian Linguistics, 2020
This article provides a quantitative corpus-based investigation of the Russian verb rhyme and its... more This article provides a quantitative corpus-based investigation of the Russian verb rhyme and its change in the Russian poetic tradition from the beginning of the 19th century to the 1960s. Versologists have studied the rhyme primarily as a phonetic entity, whereas morphology also contributes to the rhyme euphony due to the regularity of grammatical affixes. The research focuses on a micro-diachronic analysis of verb rhymes, summarises the identified historical trends, and defines acceptable and clearly avoided verb forms. The article also analyses the morphological patterns of verb rhymes including the most common lexical pairings and combinations of particular grammatical forms with different parts of speech. The study analyses data from the Corpus of Russian Poetry (a part of the Russian National Corpus) and introduces research methods and a corpus-based tool that were designed specifically for the statistical analysis and computational modelling of poetic features. The results show that authors experimented with word rhyme in various ways during different periods. Despite the idea of non-aesthetic verbal rhyme, which has existed since the time of A. Kantemir, its use in the historical perspective varies, there are periods of rise and fall. We distinguish two classes of rhyming pairs: combinations of two verb forms and morphologically dissimilar combinations of a verb form with a word of another part of speech. We conclude that restrictions on verbal rhyme apply mainly to combinations of past tense and infinitive forms.
SSRN Electronic Journal, 2018
A data analysis tool of the Corpus of Russian Poetry (a part of the Russian National Corpus) is d... more A data analysis tool of the Corpus of Russian Poetry (a part of the Russian National Corpus) is designed for quantitative research in various areas of versology and linguistics aspects of the poetic texts. The core part, a frequency database of the corpus, includes annotation at the level of texts, verses, words as well as patterns of words, letters, and stress. The tool allows a user to study certain properties (e. g. rhyming patterns, lexical co-occurrence) taken alone and in their interaction, both in the whole corpus and in subcorpora. Besides that, it facilitates the contrastive studies of two chosen subcorpora. The paper reports a few case studies demonstrating applicable descriptive and exploratory methods and potential for further research in the field of the digital literary studies. JEL Classification: Z.
Communications in Computer and Information Science, 2015
Russian FrameBank is a bank of annotated samples from the Russian National Corpus which documents... more Russian FrameBank is a bank of annotated samples from the Russian National Corpus which documents the use of lexical constructions (e.g. argument constructions of verbs and nouns). FrameBank belongs to FrameNetoriented resources, but unlike Berkeley FrameNet it focuses more on the morphosyntactic and semantic features of individual lexemes rather than the generalized frames, following the theoretical approaches of Construction Grammar (Ch. Fillmore, A. Goldberg, etc.) and of Moscow Semantic School (Ju. D. Apresjan, E. V. Paducheva, etc.).
NSU Vestnik. Series: Linguistics and Intercultural Communication
Corpora-based language studies is a widespread practice in modern linguistics. In our study, we a... more Corpora-based language studies is a widespread practice in modern linguistics. In our study, we address so-called mass literature (or paraliterature) via a corpus of texts. The standardization of mass literature allows us to describe its genres by applying literary formulas. In brief, a formula serves for the embodiment of cultural themes and stereotypes in a universal form. While mass literature is a common subject of literary and cultural studies, from a linguistic point of view, literary formulas have not been studied well enough. We suggest that differences between the microgenres of paraliterature may be in syntax as well as in the vocabulary. Our work is based on the mass literature corpora and provides analysis of verb constructions characteristic of microgenres (love story, detective story, science fiction novel, and fantasy). In order to identify the distinctive features of mass literature microgenres, we have conducted a series of machine learning experiments. As a dataset...
Universal Dependencies Consortium, Nov 15, 2020
Quantitative Approaches to the Russian Language, 2017
Communications in Computer and Information Science, 2018
Cross-tagset parsing is based on the substitution of one annotation layer for another while proce... more Cross-tagset parsing is based on the substitution of one annotation layer for another while processing data within one language. As often as not, either the native tagger or the dependency parser used in (pre-)annotation of the Gold treebank is not available. The cross-tagset approach allows one to annotate new texts using freely available tools or tools optimized to user's needs. We evaluate the robustness of Russian dependency parsing using different morphological and syntactic tagsets in input and output. Qualitative analysis of errors shows that the crosssubstitution of three morphological tagsets and two syntactic tagsets causes only a mild drop in performance.
Communications in Computer and Information Science, 2019
The poetic texts pose a challenge to full morphological tagging and lemmatization since the autho... more The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic taggers based on decision trees, CRF and neural network algorithms as well as a state-of-the-art dictionary-based tagger. The taggers were trained on prosaic texts and tested on three poetic samples of different complexity. Firstly, we suggest a method to compile the gold standard datasets for the Russian poetry. Secondly, we focus on the taggers’ performance in the identification of the part of speech tags and lemmas. We reveal what kind of POS classes, paradigm classes and syntactic patterns mostly affect the quality of processing.
Zaliznjak & Mikaèljan 2014 is a critique of the model of Russian aspect found in Janda 2012 and J... more Zaliznjak & Mikaèljan 2014 is a critique of the model of Russian aspect found in Janda 2012 and Janda et al. 2013. In this rebuttal I give a brief overview of my model of Russian aspect and then address the criticisms made by Zaliznjak & Mikaèljan. I begin by examining the four assumptions stated by Zaliznjak & Mikaèljan, which I find to be unnecessary and lacking in theoretical gounding. Their assumption that aspectual correlation is uniformly directed from perfective to imperfective is particularly problematic. I compare Zaliznjak & Mikaèljan’s assumption with the single assumption my work is based on, namely that linguistic cognition is not fundamentally different from general cognition, and present the entailments of this assumption and what they mean for an investigation of Russian aspect. I then present four further problems with Zaliznjak & Mikaèljan’s model of Russian aspect: the alleged transfer of meaning from perfective to imperfective, the criteria for identifying protot...
This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 inst... more This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 instead.
RU-EVAL is a biennial event organized in order to estimate the state of the art in Russian NLP re... more RU-EVAL is a biennial event organized in order to estimate the state of the art in Russian NLP resources, methods and toolkits and to compare various methods and principles implemented for Russian. Russian could be treated as an under-resourced language due to the lack of free distributable gold standard corpora for different NLP tasks (each team tried to work out their own standards). Thus, our goal was to work out the uniform basis for comparison of systems based on different theoretical and engineering approaches, to build evaluation resources, to provide a flexible system of evaluation in order to differentiate between non-acceptable and linguistically “admissible” errors. The paper reports on three events devoted to morphological tagging, dependency parsing and anaphora resolution, respectively.
In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) tre... more In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) treebanks for Russian: UD_Russian-SynTagRus, UD_Russian-GSD, UD_Russian-Taiga, and UD_Russian-PUD. We describe the four treebanks, their distinctive features and development. In order to test and improve consistency within the treebanks, we reconsidered the experiments by Martínez Alonso and Zeman; our parsing experiments were conducted using a state-of-the-art parser that took part in the CoNLL 2017 Shared Task. We analyze error classes in functional and content relations and discuss a method to separate the errors induced by annotation inconsistency and those caused by syntactic complexity and other factors.
This paper describes the distribution of colour adjectives in Russian poetry of the Silver Age an... more This paper describes the distribution of colour adjectives in Russian poetry of the Silver Age and defines individual preferences with regard to poetic tradition, syllable structure, and metrical restrictions. The research method combines a lexico-semantic approach, formal literary analysis, and quantitative metrics obtained via the frequency database of the Russian Poetry Corpus (over 10 M words, incl. 1 M adjectives). The database allows the user to compare subcorpora and create graphs of timeline distribution, which demonstrate that the lexical diversity and relative frequencies of colour adjectives start to grow rapidly in the 1890s, as modernists employ colour adjectives to upgrade the poetic inventory. The adjectives referring to non-banal hues (e.g. fioletovyj ‘violet’, lazorevyj ‘azur’) belong to the middle part of the ranked wordlist. Correspondence analysis of the data reveals individual colour preferences and stylistic similarities among the most prominent poets of the Si...
The task of the semantic role labeling usually focuses on identifying and classifying the core, o... more The task of the semantic role labeling usually focuses on identifying and classifying the core, obligatory argu ments of the predicate. The adjuncts of Time, Location, etc. (noncore, modifier arguments) are considered on the periphery of the task [30] and even doing the easy part of it [44], despite the fact that they are highly integrated into the clause structure and may nontrivially interact with the meaning of the verb [4, 32]. In this paper, we present experiments on labeling the adjunct roles of LOCATION, TIME, MANNER, DEGREE, REASON, and PURPOSE, based on the manually annotated AdjunctsFrameBank data set. The results show an average F1score of 0.94 on the gold adjunct phrase annotations using the word2vec representations of adjuncts, word2vec representations of pre dicates, and the moprhosyntactic marking of adjuncts. Our findings generally corroborate the theoretical hypothesis on the structural and semantic autonomy and lexicomorphosyntactic specialization of adjuncts. Yet, more com plicated organization of their network is revealed, pointing to the diversity of adjuncts in terms of their distribution and behavior.
Russian Linguistics, 2020
This article provides a quantitative corpus-based investigation of the Russian verb rhyme and its... more This article provides a quantitative corpus-based investigation of the Russian verb rhyme and its change in the Russian poetic tradition from the beginning of the 19th century to the 1960s. Versologists have studied the rhyme primarily as a phonetic entity, whereas morphology also contributes to the rhyme euphony due to the regularity of grammatical affixes. The research focuses on a micro-diachronic analysis of verb rhymes, summarises the identified historical trends, and defines acceptable and clearly avoided verb forms. The article also analyses the morphological patterns of verb rhymes including the most common lexical pairings and combinations of particular grammatical forms with different parts of speech. The study analyses data from the Corpus of Russian Poetry (a part of the Russian National Corpus) and introduces research methods and a corpus-based tool that were designed specifically for the statistical analysis and computational modelling of poetic features. The results show that authors experimented with word rhyme in various ways during different periods. Despite the idea of non-aesthetic verbal rhyme, which has existed since the time of A. Kantemir, its use in the historical perspective varies, there are periods of rise and fall. We distinguish two classes of rhyming pairs: combinations of two verb forms and morphologically dissimilar combinations of a verb form with a word of another part of speech. We conclude that restrictions on verbal rhyme apply mainly to combinations of past tense and infinitive forms.
SSRN Electronic Journal, 2018
A data analysis tool of the Corpus of Russian Poetry (a part of the Russian National Corpus) is d... more A data analysis tool of the Corpus of Russian Poetry (a part of the Russian National Corpus) is designed for quantitative research in various areas of versology and linguistics aspects of the poetic texts. The core part, a frequency database of the corpus, includes annotation at the level of texts, verses, words as well as patterns of words, letters, and stress. The tool allows a user to study certain properties (e. g. rhyming patterns, lexical co-occurrence) taken alone and in their interaction, both in the whole corpus and in subcorpora. Besides that, it facilitates the contrastive studies of two chosen subcorpora. The paper reports a few case studies demonstrating applicable descriptive and exploratory methods and potential for further research in the field of the digital literary studies. JEL Classification: Z.
Communications in Computer and Information Science, 2015
Russian FrameBank is a bank of annotated samples from the Russian National Corpus which documents... more Russian FrameBank is a bank of annotated samples from the Russian National Corpus which documents the use of lexical constructions (e.g. argument constructions of verbs and nouns). FrameBank belongs to FrameNetoriented resources, but unlike Berkeley FrameNet it focuses more on the morphosyntactic and semantic features of individual lexemes rather than the generalized frames, following the theoretical approaches of Construction Grammar (Ch. Fillmore, A. Goldberg, etc.) and of Moscow Semantic School (Ju. D. Apresjan, E. V. Paducheva, etc.).