Polina Panicheva | Saint-Petersburg State University (original) (raw)
Papers by Polina Panicheva
EPJ Data Science
Despite recent achievements in predicting personality traits and some other human psychological f... more Despite recent achievements in predicting personality traits and some other human psychological features with digital traces, prediction of subjective well-being (SWB) appears to be a relatively new task with few solutions. COVID-19 pandemic has added both a stronger need for rapid SWB screening and new opportunities for it, with online mental health applications gaining popularity and accumulating large and diverse user data. Nevertheless, the few existing works so far have aimed at predicting SWB, and have done so only in terms of Diener’s Satisfaction with Life Scale. None of them analyzes the scale developed by the World Health Organization, known as WHO-5 – a widely accepted tool for screening mental well-being and, specifically, for depression risk detection. Moreover, existing research is limited to English-speaking populations, and tend to use text, network and app usage types of data separately. In the current work, we cover these gaps by predicting both mentioned SWB scale...
Our experiment is aimed at evaluating the performance of distributional semantic features in meta... more Our experiment is aimed at evaluating the performance of distributional semantic features in metaphor identification in Russian raw text. We apply two types of distributional features representing similarity between the metaphoric/literal verb and its syntactic or linear context. Our approach is evaluated on a dataset of nine Russian verb context, which is made available to the community. The results show that both sets of similarity features are useful for metaphor identification, and do not replicate each other, as their combination systematically improves the performance for individual verb sense classification, reaching state-of-the-art results for verbal metaphor identification. A combined verb classification demonstrates that the suggested features effectively generalize over metaphoric usage in different verbs, shows that linear coherence features perform as well as the combined feature approach. By analyzing the errors we conclude that syntactic parsing quality is still mode...
2016 IEEE Artificial Intelligence and Natural Language Conference (AINL), 2016
The presented project is intended to make use of growing amounts or textual data in social networ... more The presented project is intended to make use of growing amounts or textual data in social networks in the Russian language, In order to Hnd Ungulstlc correlates of the Dark Triad personality traits, comprising non-clinical Nareissism, Machiavellianism and Psychopathy. The baekgronnd for the ilwestigation includes, on the one haotl, psychological research on these phenomena and their measurement instruments, and on the other haod, recent advaoces In computational stylometry and text-based author profiling. The measures for these psychological phenomena are provided by recognized self-report psychological surveys adapted to Russian. Morphological and semantic analysis are applied to investigate the relationship between the Dark traits and their linguistic manifestation in social network texts. Slgnlflcant morphological and semantic correlates of Narcissism, MachlavelUanlsm and Psychopathy are ldentllled and compared to respective advaoces In Engltsh author proftUng. In order to deepe...
In the paper we present distributed vector space models based on word embeddings and a specific a... more In the paper we present distributed vector space models based on word embeddings and a specific association-oriented count-based distributional algorithm which have been applied to measuring association strength in Russian syntagmatic relations (namely, between nouns and adjectives). We discuss the compositional properties of the vectors representing nouns, adjectives and adjective-noun compositions and propose two methods of detecting the syntactic association possibility. The accuracy of the proposed measures is evaluated by means of a pseudo-disambiguation test procedure and all models show considerably high results. The errors are manually annotated, and the model errors are classified in terms of their linguistic nature and compositionality features.
ExLing 2016: Proceedings of 7th Tutorial and Research Workshop on Experimental Linguistics, Dec 1, 2019
An algorithm of analyzing obscure lexical collocations is proposed. It is based on a cooccurrence... more An algorithm of analyzing obscure lexical collocations is proposed. It is based on a cooccurrence model and distributional semantic filtering. We apply the proposed technique to lexical errors of construction blending, as annotated in the Corpus of Russian Student Texts. Results of error processing are analyzed and classified; reasons for different results in the paraphrasing experiment are discussed.
Computers in Human Behavior, 2017
The goal of this paper was to assess the connection between dark personality traits and engagemen... more The goal of this paper was to assess the connection between dark personality traits and engagement in harmful online behaviors in a sample of Russian Facebook users, and to describe the language they use in online communication. A total of 6724 individuals participated in the study (mean age ¼ 44.96 years, age range: 18e85 years, 77.9% d female). Data was collected via a purpose-built application, which served two purposes: administer the survey and download consenting user's public wall posts, gender and age from the Facebook profile. The survey included questions on engagement in harmful online behaviors and the Short Dark Triad scale; 15,281 wall posts from 1972 users were included in the dataset. These posts were subjected to morphological, lexical and semantic analyses. More than 25% of the sample reported engaging in harmful online behaviors. Males were more likely to send insulting or threatening messages and post aggressive comments; no gender differences were found for disseminating other people's private information. Psychopathy and male gender were the unique predictors of engagement in harmful online behaviors. A number of significant correlations were found between the dark traits and numeric, lexical, morphological and semantic characteristics of the participants' posts.
Communications in Computer and Information Science, 2017
In the paper vector-space semantic models based on Word2Vec word embeddings algorithm and a count... more In the paper vector-space semantic models based on Word2Vec word embeddings algorithm and a count-based association-oriented algorithm are evaluated and compared by measuring association strength between Russian nouns and adjectives. A dataset of nouns and associated adjectives is used as the test set for pseudodisambiguation task. Models are trained with corpora of Russian fiction. A measure of lexical association anomaly is applied evaluating similarity between the initial noun and the resulting attributive phrase. Results of association strength are reported for models characterized by different parameter values; the best parameter value combinations are proposed. The test exemplars producing the error rate are manually annotated, and the model errors are categorized in terms of their linguistic nature and compositionality features.
SAGE Open, 2020
Positive mental health is considered to be a significant predictor of health and longevity; howev... more Positive mental health is considered to be a significant predictor of health and longevity; however, our understanding of the ways in which this important characteristic is represented in users’ behavior on social networking sites is limited. The goal of this study was to explore associations between positive mental health and language used in online communication in a large sample of Russian Facebook users. The five-item World Health Organization Well-Being Index (WHO-5) was used as a self-report measure of well-being. Morphological, sentiment, and semantic analyses were performed for linguistic data. The total of 6,724 participants completed the questionnaire and linguistic data were available for 1,972. Participants’ mean age was 45.7 years ( SD = 11.6 years); 73.4% were female. The dataset included 15,281 posts, with an average of 7.67 ( SD = 5.69) posts per participant. Mean WHO-5 score was 60.0 ( SD = 19.1), with female participants exhibiting lower scores. Use of negative sen...
Identifying subjective statements in news titles using a personal sense annotation framework
Blogs are a very important part of the digital world, indeed they can be viewed as a digital repr... more Blogs are a very important part of the digital world, indeed they can be viewed as a digital representation of the whole world. People share pictures and videos, describe their daily life, ask questions and, of course, give opinions. The blogosphere presents a unique opportunity to obtain huge statistics about what people like, feel, need – about their ‘private states’. The vast and ever-growing volumes of ‘bloggers’ and thus, information, demand an automated way of analyzing blog texts. This gives rise to a new research direction combining computing, linguistics and psychology: sentiment analysis: the computational treatment of (in alphabetical order) opinion, sentiment, and subjectivity in text. Objective characteristics of the writers based on their texts can be analyzed: their age, gender, social affiliation, character; subjective characteristics as moods, negative or positive opinions – polarity, emotions towards an object – can also be investigated. In the thesis we argue that...
The goal of the current work is to evaluate semantic feature aggregation techniques in a task of ... more The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.
Identifying subjective statements in news titles using a personal sense annotation framework
Blogs are a very important part of the digital world, indeed they can be viewed as a digital repr... more Blogs are a very important part of the digital world, indeed they can be viewed as a digital representation of the whole world. People share pictures and videos, describe their daily life, ask questions and, of course, give opinions. The blogosphere presents a unique opportunity to obtain huge statistics about what people like, feel, need – about their ‘private states’. The vast and ever-growing volumes of ‘bloggers’ and thus, information, demand an automated way of analyzing blog texts. This gives rise to a new research direction combining computing, linguistics and psychology: sentiment analysis: the computational treatment of (in alphabetical order) opinion, sentiment, and subjectivity in text. Objective characteristics of the writers based on their texts can be analyzed: their age, gender, social affiliation, character; subjective characteristics as moods, negative or positive opinions – polarity, emotions towards an object – can also be investigated. In the thesis we argue that...
Subjectivity analysis and authorship attribution are very popular areas of research. However, wor... more Subjectivity analysis and authorship attribution are very popular areas of research. However, work in these two areas has been done separately. We believe that by combining information about subjectivity in texts and authorship, the performance of both tasks can be improved. In the paper a personalized approach to opinion mining is presented, in which the notions of personal sense and idiolect are introduced; the approach is applied to the polarity classification task. It is assumed that different authors express their private states in text individually, and opinion mining results could be improved by analyzing texts by different authors separately. The hypothesis is tested on a corpus of movie reviews by ten authors. The results of applying the personalized approach to opinion mining are presented, confirming that the approach increases the performance of the opinion mining task. Automatic authorship attribution is further applied to model the personalized approach, classifying do...
The goal of the current work is to evaluate semantic feature aggregation techniques in a task of ... more The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.
Subjectivity analysis and authorship attribution are very popular areas of research. However, wor... more Subjectivity analysis and authorship attribution are very popular areas of research. However, work in these two areas has been done separately. We believe that by combining information about subjectivity in texts and authorship, the performance of both tasks can be improved. In the paper a personalized approach to opinion mining is presented, in which the notions of personal sense and idiolect are introduced; the approach is applied to the polarity classification task. It is assumed that different authors express their private states in text individually, and opinion mining results could be improved by analyzing texts by different authors separately. The hypothesis is tested on a corpus of movie reviews by ten authors. The results of applying the personalized approach to opinion mining are presented, confirming that the approach increases the performance of the opinion mining task. Automatic authorship attribution is further applied to model the personalized approach, classifying do...
We propose a distributional approach to automatic correction of abnormal collocations in a Russia... more We propose a distributional approach to automatic correction of abnormal collocations in a Russian text corpus containing different types of erroneous word combinations, in particular, construction blending. We develop a toolkit which uses syntactic bigrams from RNC Sketches as training data and Word2Vec semantic model. A corpus of Russian Student Texts with annotation of erroneous word combinations, parsed morpho-syntactically with TreeTagger and MaltParser, was used in experiments. The annotated construction blending errors have been analyzed in terms of error correction by automatically proposing substitution candidates. The correction algorithm involves a set of association metrics based on context selectional preferences and semantic modeling, allowing to rank substitution candidates by their acceptability. Experimental results with nouns annotated as construction blending errors demonstrate the effectiveness of our toolkit. The results show that co-occurrence and Word2Vec sema...
We propose a distributional approach to automatic correction of abnormal collocations in a Russia... more We propose a distributional approach to automatic correction of abnormal collocations in a Russian text corpus containing different types of erroneous word combinations, in particular, construction blending. We develop a toolkit which uses syntactic bigrams from RNC Sketches as training data and Word2Vec semantic model. A corpus of Russian Student Texts with annotation of erroneous word combinations, parsed morpho-syntactically with TreeTagger and MaltParser, was used in experiments. The annotated construction blending errors have been analyzed in terms of error correction by automatically proposing substitution candidates. The correction algorithm involves a set of association metrics based on context selectional preferences and semantic modeling, allowing to rank substitution candidates by their acceptability. Experimental results with nouns annotated as construction blending errors demonstrate the effectiveness of our toolkit. The results show that co-occurrence and Word2Vec sema...
Russ. Digit. Libr. J., 2015
The task of predicting demographics of social media users, bloggers and authors of other types of... more The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area in the Russian language. Filling this gap will elaborate on the multi-lingual insights into age-specific linguistic features and will provide a crucial step towards online security management in social networks. We present the first age-annotated dataset in Russian. The dataset contains blogs of 1260 authors from LiveJournal and is balanced against both age group and gender of the author. We perform age classification experiments (for age groups 20–30, 30–40, 40–50) with the presented data using basic linguistic features (lemmas, part-of-speech unigrams and bigrams etc.) and obtain a considerable baseline in age classification for Russian. W...
EPJ Data Science
Despite recent achievements in predicting personality traits and some other human psychological f... more Despite recent achievements in predicting personality traits and some other human psychological features with digital traces, prediction of subjective well-being (SWB) appears to be a relatively new task with few solutions. COVID-19 pandemic has added both a stronger need for rapid SWB screening and new opportunities for it, with online mental health applications gaining popularity and accumulating large and diverse user data. Nevertheless, the few existing works so far have aimed at predicting SWB, and have done so only in terms of Diener’s Satisfaction with Life Scale. None of them analyzes the scale developed by the World Health Organization, known as WHO-5 – a widely accepted tool for screening mental well-being and, specifically, for depression risk detection. Moreover, existing research is limited to English-speaking populations, and tend to use text, network and app usage types of data separately. In the current work, we cover these gaps by predicting both mentioned SWB scale...
Our experiment is aimed at evaluating the performance of distributional semantic features in meta... more Our experiment is aimed at evaluating the performance of distributional semantic features in metaphor identification in Russian raw text. We apply two types of distributional features representing similarity between the metaphoric/literal verb and its syntactic or linear context. Our approach is evaluated on a dataset of nine Russian verb context, which is made available to the community. The results show that both sets of similarity features are useful for metaphor identification, and do not replicate each other, as their combination systematically improves the performance for individual verb sense classification, reaching state-of-the-art results for verbal metaphor identification. A combined verb classification demonstrates that the suggested features effectively generalize over metaphoric usage in different verbs, shows that linear coherence features perform as well as the combined feature approach. By analyzing the errors we conclude that syntactic parsing quality is still mode...
2016 IEEE Artificial Intelligence and Natural Language Conference (AINL), 2016
The presented project is intended to make use of growing amounts or textual data in social networ... more The presented project is intended to make use of growing amounts or textual data in social networks in the Russian language, In order to Hnd Ungulstlc correlates of the Dark Triad personality traits, comprising non-clinical Nareissism, Machiavellianism and Psychopathy. The baekgronnd for the ilwestigation includes, on the one haotl, psychological research on these phenomena and their measurement instruments, and on the other haod, recent advaoces In computational stylometry and text-based author profiling. The measures for these psychological phenomena are provided by recognized self-report psychological surveys adapted to Russian. Morphological and semantic analysis are applied to investigate the relationship between the Dark traits and their linguistic manifestation in social network texts. Slgnlflcant morphological and semantic correlates of Narcissism, MachlavelUanlsm and Psychopathy are ldentllled and compared to respective advaoces In Engltsh author proftUng. In order to deepe...
In the paper we present distributed vector space models based on word embeddings and a specific a... more In the paper we present distributed vector space models based on word embeddings and a specific association-oriented count-based distributional algorithm which have been applied to measuring association strength in Russian syntagmatic relations (namely, between nouns and adjectives). We discuss the compositional properties of the vectors representing nouns, adjectives and adjective-noun compositions and propose two methods of detecting the syntactic association possibility. The accuracy of the proposed measures is evaluated by means of a pseudo-disambiguation test procedure and all models show considerably high results. The errors are manually annotated, and the model errors are classified in terms of their linguistic nature and compositionality features.
ExLing 2016: Proceedings of 7th Tutorial and Research Workshop on Experimental Linguistics, Dec 1, 2019
An algorithm of analyzing obscure lexical collocations is proposed. It is based on a cooccurrence... more An algorithm of analyzing obscure lexical collocations is proposed. It is based on a cooccurrence model and distributional semantic filtering. We apply the proposed technique to lexical errors of construction blending, as annotated in the Corpus of Russian Student Texts. Results of error processing are analyzed and classified; reasons for different results in the paraphrasing experiment are discussed.
Computers in Human Behavior, 2017
The goal of this paper was to assess the connection between dark personality traits and engagemen... more The goal of this paper was to assess the connection between dark personality traits and engagement in harmful online behaviors in a sample of Russian Facebook users, and to describe the language they use in online communication. A total of 6724 individuals participated in the study (mean age ¼ 44.96 years, age range: 18e85 years, 77.9% d female). Data was collected via a purpose-built application, which served two purposes: administer the survey and download consenting user's public wall posts, gender and age from the Facebook profile. The survey included questions on engagement in harmful online behaviors and the Short Dark Triad scale; 15,281 wall posts from 1972 users were included in the dataset. These posts were subjected to morphological, lexical and semantic analyses. More than 25% of the sample reported engaging in harmful online behaviors. Males were more likely to send insulting or threatening messages and post aggressive comments; no gender differences were found for disseminating other people's private information. Psychopathy and male gender were the unique predictors of engagement in harmful online behaviors. A number of significant correlations were found between the dark traits and numeric, lexical, morphological and semantic characteristics of the participants' posts.
Communications in Computer and Information Science, 2017
In the paper vector-space semantic models based on Word2Vec word embeddings algorithm and a count... more In the paper vector-space semantic models based on Word2Vec word embeddings algorithm and a count-based association-oriented algorithm are evaluated and compared by measuring association strength between Russian nouns and adjectives. A dataset of nouns and associated adjectives is used as the test set for pseudodisambiguation task. Models are trained with corpora of Russian fiction. A measure of lexical association anomaly is applied evaluating similarity between the initial noun and the resulting attributive phrase. Results of association strength are reported for models characterized by different parameter values; the best parameter value combinations are proposed. The test exemplars producing the error rate are manually annotated, and the model errors are categorized in terms of their linguistic nature and compositionality features.
SAGE Open, 2020
Positive mental health is considered to be a significant predictor of health and longevity; howev... more Positive mental health is considered to be a significant predictor of health and longevity; however, our understanding of the ways in which this important characteristic is represented in users’ behavior on social networking sites is limited. The goal of this study was to explore associations between positive mental health and language used in online communication in a large sample of Russian Facebook users. The five-item World Health Organization Well-Being Index (WHO-5) was used as a self-report measure of well-being. Morphological, sentiment, and semantic analyses were performed for linguistic data. The total of 6,724 participants completed the questionnaire and linguistic data were available for 1,972. Participants’ mean age was 45.7 years ( SD = 11.6 years); 73.4% were female. The dataset included 15,281 posts, with an average of 7.67 ( SD = 5.69) posts per participant. Mean WHO-5 score was 60.0 ( SD = 19.1), with female participants exhibiting lower scores. Use of negative sen...
Identifying subjective statements in news titles using a personal sense annotation framework
Blogs are a very important part of the digital world, indeed they can be viewed as a digital repr... more Blogs are a very important part of the digital world, indeed they can be viewed as a digital representation of the whole world. People share pictures and videos, describe their daily life, ask questions and, of course, give opinions. The blogosphere presents a unique opportunity to obtain huge statistics about what people like, feel, need – about their ‘private states’. The vast and ever-growing volumes of ‘bloggers’ and thus, information, demand an automated way of analyzing blog texts. This gives rise to a new research direction combining computing, linguistics and psychology: sentiment analysis: the computational treatment of (in alphabetical order) opinion, sentiment, and subjectivity in text. Objective characteristics of the writers based on their texts can be analyzed: their age, gender, social affiliation, character; subjective characteristics as moods, negative or positive opinions – polarity, emotions towards an object – can also be investigated. In the thesis we argue that...
The goal of the current work is to evaluate semantic feature aggregation techniques in a task of ... more The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.
Identifying subjective statements in news titles using a personal sense annotation framework
Blogs are a very important part of the digital world, indeed they can be viewed as a digital repr... more Blogs are a very important part of the digital world, indeed they can be viewed as a digital representation of the whole world. People share pictures and videos, describe their daily life, ask questions and, of course, give opinions. The blogosphere presents a unique opportunity to obtain huge statistics about what people like, feel, need – about their ‘private states’. The vast and ever-growing volumes of ‘bloggers’ and thus, information, demand an automated way of analyzing blog texts. This gives rise to a new research direction combining computing, linguistics and psychology: sentiment analysis: the computational treatment of (in alphabetical order) opinion, sentiment, and subjectivity in text. Objective characteristics of the writers based on their texts can be analyzed: their age, gender, social affiliation, character; subjective characteristics as moods, negative or positive opinions – polarity, emotions towards an object – can also be investigated. In the thesis we argue that...
Subjectivity analysis and authorship attribution are very popular areas of research. However, wor... more Subjectivity analysis and authorship attribution are very popular areas of research. However, work in these two areas has been done separately. We believe that by combining information about subjectivity in texts and authorship, the performance of both tasks can be improved. In the paper a personalized approach to opinion mining is presented, in which the notions of personal sense and idiolect are introduced; the approach is applied to the polarity classification task. It is assumed that different authors express their private states in text individually, and opinion mining results could be improved by analyzing texts by different authors separately. The hypothesis is tested on a corpus of movie reviews by ten authors. The results of applying the personalized approach to opinion mining are presented, confirming that the approach increases the performance of the opinion mining task. Automatic authorship attribution is further applied to model the personalized approach, classifying do...
The goal of the current work is to evaluate semantic feature aggregation techniques in a task of ... more The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.
Subjectivity analysis and authorship attribution are very popular areas of research. However, wor... more Subjectivity analysis and authorship attribution are very popular areas of research. However, work in these two areas has been done separately. We believe that by combining information about subjectivity in texts and authorship, the performance of both tasks can be improved. In the paper a personalized approach to opinion mining is presented, in which the notions of personal sense and idiolect are introduced; the approach is applied to the polarity classification task. It is assumed that different authors express their private states in text individually, and opinion mining results could be improved by analyzing texts by different authors separately. The hypothesis is tested on a corpus of movie reviews by ten authors. The results of applying the personalized approach to opinion mining are presented, confirming that the approach increases the performance of the opinion mining task. Automatic authorship attribution is further applied to model the personalized approach, classifying do...
We propose a distributional approach to automatic correction of abnormal collocations in a Russia... more We propose a distributional approach to automatic correction of abnormal collocations in a Russian text corpus containing different types of erroneous word combinations, in particular, construction blending. We develop a toolkit which uses syntactic bigrams from RNC Sketches as training data and Word2Vec semantic model. A corpus of Russian Student Texts with annotation of erroneous word combinations, parsed morpho-syntactically with TreeTagger and MaltParser, was used in experiments. The annotated construction blending errors have been analyzed in terms of error correction by automatically proposing substitution candidates. The correction algorithm involves a set of association metrics based on context selectional preferences and semantic modeling, allowing to rank substitution candidates by their acceptability. Experimental results with nouns annotated as construction blending errors demonstrate the effectiveness of our toolkit. The results show that co-occurrence and Word2Vec sema...
We propose a distributional approach to automatic correction of abnormal collocations in a Russia... more We propose a distributional approach to automatic correction of abnormal collocations in a Russian text corpus containing different types of erroneous word combinations, in particular, construction blending. We develop a toolkit which uses syntactic bigrams from RNC Sketches as training data and Word2Vec semantic model. A corpus of Russian Student Texts with annotation of erroneous word combinations, parsed morpho-syntactically with TreeTagger and MaltParser, was used in experiments. The annotated construction blending errors have been analyzed in terms of error correction by automatically proposing substitution candidates. The correction algorithm involves a set of association metrics based on context selectional preferences and semantic modeling, allowing to rank substitution candidates by their acceptability. Experimental results with nouns annotated as construction blending errors demonstrate the effectiveness of our toolkit. The results show that co-occurrence and Word2Vec sema...
Russ. Digit. Libr. J., 2015
The task of predicting demographics of social media users, bloggers and authors of other types of... more The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area in the Russian language. Filling this gap will elaborate on the multi-lingual insights into age-specific linguistic features and will provide a crucial step towards online security management in social networks. We present the first age-annotated dataset in Russian. The dataset contains blogs of 1260 authors from LiveJournal and is balanced against both age group and gender of the author. We perform age classification experiments (for age groups 20–30, 30–40, 40–50) with the presented data using basic linguistic features (lemmas, part-of-speech unigrams and bigrams etc.) and obtain a considerable baseline in age classification for Russian. W...
In the paper vector-space semantic models based on Word2Vec word embeddings algorithm and a count... more In the paper vector-space semantic models based on Word2Vec word embeddings algorithm and a count-based association-oriented algorithm are evaluated and compared by measuring association strength between Russian nouns and adjectives. A dataset of nouns and associated adjectives is used as the test set for pseudodisambiguation task. Models are trained with corpora of Russian Fiction. A measure of lexical association anomaly is applied evaluating similarity between the initial noun and the resulting attributive phrase. Results of association strength are reported for models characterized by different parameter values; the best parameter value combinations are proposed. The test exemplars producing the error rate are manually annotated, and the model errors are categorized in terms of their linguistic nature and compositionality features.
In the paper we present distributed vector space models based on word embeddings and a specific a... more In the paper we present distributed vector space models based on word embeddings and a specific association-oriented count-based distributional algorithm which have been applied to measuring association strength in Russian syntagmatic relations (namely, between nouns and adjectives). We discuss the compositional properties of the vectors representing nouns, adjectives and adjective-noun compositions and propose two methods of detecting the syntactic association possibility. The accuracy of the proposed measures is evaluated by means of a pseudo-disambiguation test procedure and all models show considerably high results. The errors are manually annotated, and the model errors are classified in terms of their linguistic nature and compositionality features.
An algorithm of analyzing obscure lexical collocations is proposed. It is based on a co-occurrenc... more An algorithm of analyzing obscure lexical collocations is proposed. It is based on a co-occurrence model and distributional semantic filtering. We apply the proposed technique to lexical errors of construction blending, as annotated in the Corpus of Russian Student Texts. Results of error processing are analyzed and classified; reasons for different results in the paraphrasing experiment are discussed.
Our project is aimed at the development of the syntactic parser for Russian based on NLTK toolkit... more Our project is aimed at the development of the syntactic parser for Russian based on NLTK toolkit for Python. NLTK provides linguistic environment for building formal grammars. We describe a feature-based grammar which allows to analyze the most important syntactic groups within clauses occurring in Russian texts. Our parser operates with rules which include morphological information from the input sentences. The rules are based on the tagset accepted in PyMorphy2 morphological tagger. In the nearest future we plan to enrich our parser so that it could process any well-formed Russian sentences.