Veronika Solopova | Université Paris III - Sorbonne Nouvelle (original) (raw)

Uploads

Papers by Veronika Solopova

Research paper thumbnail of Check News in One Click: NLP-Empowered Pro-Kremlin Propaganda Detection

arXiv (Cornell University), Jan 28, 2024

Many European citizens become targets of the Kremlin propaganda campaigns, aiming to minimise pub... more Many European citizens become targets of the Kremlin propaganda campaigns, aiming to minimise public support for Ukraine, foster a climate of mistrust and disunity, and shape elections (Meister, 2022). To address this challenge, we developed "Check News in 1 Click", the first NLP-empowered pro-Kremlin propaganda detection application available in 7 languages, which provides the lay user with feedback on their news, and explains manipulative linguistic features and keywords. We conducted a user study, analysed user entries and models' behaviour paired with questionnaire answers, and investigated the advantages and disadvantages of the proposed interpretative solution.

Research paper thumbnail of PapagAI: Automated Feedback for Reflective Essays

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of The Evolution of Pro-Kremlin Propaganda From a Machine Learning and Linguistics Perspective

In the Russo-Ukrainian war, propaganda is produced by Russian state-run news outlets for both int... more In the Russo-Ukrainian war, propaganda is produced by Russian state-run news outlets for both international and domestic audiences. Its content and form evolve and change with time as the war continues. This constitutes a challenge to content moderation tools based on machine learning when the data used for training and the current news start to differ significantly. In this follow-up study, we evaluate our previous BERT and SVM models that classify Pro-Kremlin propaganda from a Pro-Western stance, trained on the data from news articles and telegram posts at the start of 2022, on the new 2023 subset. We examine both classifiers' errors and perform a comparative analysis of these subsets to investigate which changes in narratives provoke drops in performance.

Research paper thumbnail of Verbreitungsmechanismen sch\"adigender Sprache im Netz: Anatomie zweier Shitstorms

arXiv (Cornell University), Dec 11, 2023

Research paper thumbnail of Telegram chat corpus

Research paper thumbnail of PapagAI:Automated Feedback for Reflective Essays

arXiv (Cornell University), Jul 10, 2023

Written reflective practice is a regular exercise pre-service teachers perform during their highe... more Written reflective practice is a regular exercise pre-service teachers perform during their higher education. Usually, their lecturers are expected to provide individual feedback, which can be a challenging task to perform on a regular basis. In this paper, we present the first open-source automated feedback tool based on didactic theory and implemented as a hybrid AI system. We describe the components and discuss the advantages and disadvantages of our system compared to the state-of-art generative large language models. The main objective of our work is to enable better learning outcomes for students and to complement the teaching activities of lecturers.

Research paper thumbnail of Automated Content Moderation Using Transparent Solutions and Linguistic Expertise

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Since the dawn of Transformer-based models, the trade-off between transparency and accuracy has b... more Since the dawn of Transformer-based models, the trade-off between transparency and accuracy has been a topical issue in the NLP community. Working towards ethical and transparent automated content moderation (ACM), my goal is to find where it is still relevant to implement linguistic expertise. I show that transparent statistical models based on linguistic knowledge can still be competitive, while linguistic features have many other useful applications.

Research paper thumbnail of Automated multilingual detection of Pro-Kremlin propaganda in newspapers and Telegram posts

arXiv (Cornell University), Jan 25, 2023

Research paper thumbnail of Automated Identification of Discourse Connectives in Ukrainian

Springer International Publishing eBooks, 2022

Research paper thumbnail of Automated Multilingual Detection of Pro-Kremlin Propaganda in Newspapers and Telegram Posts

Datenbank-Spektrum

The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amo... more The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amount of news articles and social media data reflecting opposing ideologies and narratives. These polarized campaigns have led to mutual accusations of misinformation and fake news, shaping an atmosphere of confusion and mistrust for readers worldwide. This study analyses how the media affected and mirrored public opinion during the first month of the war using news articles and Telegram news channels in Ukrainian, Russian, Romanian, French and English. We propose and compare two methods of multilingual automated pro-Kremlin propaganda identification, based on Transformers and linguistic features. We analyse the advantages and disadvantages of both methods, their adaptability to new genres and languages, and ethical considerations of their usage for content moderation. With this work, we aim to lay the foundation for further development of moderation tools tailored to the current conflict.

Research paper thumbnail of A Computational Lexicon of Ukrainian Discourse Connectives

We introduce a new lexicon of discourse connectives for the Ukrainian language. Discourse connect... more We introduce a new lexicon of discourse connectives for the Ukrainian language. Discourse connectives like ‘because’, ‘therefore’ are grammatical elements which link clauses and sentences semantically and play a crucial role in discourse structure. They have shown to be useful for many tasks in natural language processing from argumentation mining to authorship analysis. We introduce a semi-automatic method for inventorizing discourse connectives in underresourced languages, by leveraging existing lexicons from other languages. As a result, we provide the rst computer-readable lexicon of 129 Ukrainian discourse connectives. We provide syntactic as well as semantic information for these items. Finally, we carry out a small pilot study using the lexicon for discourse level corpus annotation, and report on the distribution of connectives in Ukrainian in two di‌erent types of media.

Research paper thumbnail of The Telegram Chronicles of Online Harm

Journal of Open Humanities Data

Harmful language is frequent in social media, in particular in spaces which are considered anonym... more Harmful language is frequent in social media, in particular in spaces which are considered anonymous and/or allow free participation. In this paper, we analyze the language in a Telegram channel populated by followers of former US President Donald Trump. We seek to identify the ways in which harmful language is used to create a specific narrative in a group of mostly like-minded discussants. Our research has several aims. First, we create an extended taxonomy of potentially harmful language that includes not only hate speech and direct insults (which have been the focus of existing computational methods), but also other forms of harmful speech discussed in the literature. We manually apply this taxonomy to a large portion of the corpus, including the time period leading up to and the aftermath of the January 2021 US Capitol riot. Our data gives empirical evidence for harmful speech, such as in/outgroup divisive language and the use of codes within certain communities, that have not often been investigated before. Second, we compare our manual annotations of harmful speech to several automatic methods for classifying hate speech and offensive language, namely list-based and machine-learning-based approaches. We find that the Telegram data sets still pose particular challenges for these automatic methods. Finally, we argue for the value of studying such naturally-occurring, coherent data sets for research on online harm and how to address it in linguistics and philosophy.

Research paper thumbnail of A Telegram Corpus for Hate Speech, Offensive Language, and Online Harm

Journal of Open Humanities Data, 2021

We provide a new text corpus from the social medium Telegram, which is rich in indirect forms of ... more We provide a new text corpus from the social medium Telegram, which is rich in indirect forms of divisive speech. We scraped all messages from one channel of Donald Trump supporters, covering a large part of his presidency, from late 2016 until January 2021, including the January 6 Capitol riot. The discussion among the group members, over this long time period, includes the spread of disinformation, disparaging of out-group members, and other forms of harmful speech. To enable research into the role of harmful speech in political discourse, we added two types of annotations to the corpus: (i) automatic annotations of offensive language for all messages, and (ii) our own manual annotations of harmful language for a portion of the posts leading up to the January 2021 Capitol riot and its aftermath.

Research paper thumbnail of Implications of the New Regulation Proposed by the European Commission on Automatic Content Moderation

2021 ISCA Symposium on Security and Privacy in Speech Communication

Research paper thumbnail of Adapting Coreference Resolution to Twitter Conversations

Findings of the Association for Computational Linguistics: EMNLP 2020

Research paper thumbnail of Check News in One Click: NLP-Empowered Pro-Kremlin Propaganda Detection

arXiv (Cornell University), Jan 28, 2024

Many European citizens become targets of the Kremlin propaganda campaigns, aiming to minimise pub... more Many European citizens become targets of the Kremlin propaganda campaigns, aiming to minimise public support for Ukraine, foster a climate of mistrust and disunity, and shape elections (Meister, 2022). To address this challenge, we developed "Check News in 1 Click", the first NLP-empowered pro-Kremlin propaganda detection application available in 7 languages, which provides the lay user with feedback on their news, and explains manipulative linguistic features and keywords. We conducted a user study, analysed user entries and models' behaviour paired with questionnaire answers, and investigated the advantages and disadvantages of the proposed interpretative solution.

Research paper thumbnail of PapagAI: Automated Feedback for Reflective Essays

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of The Evolution of Pro-Kremlin Propaganda From a Machine Learning and Linguistics Perspective

In the Russo-Ukrainian war, propaganda is produced by Russian state-run news outlets for both int... more In the Russo-Ukrainian war, propaganda is produced by Russian state-run news outlets for both international and domestic audiences. Its content and form evolve and change with time as the war continues. This constitutes a challenge to content moderation tools based on machine learning when the data used for training and the current news start to differ significantly. In this follow-up study, we evaluate our previous BERT and SVM models that classify Pro-Kremlin propaganda from a Pro-Western stance, trained on the data from news articles and telegram posts at the start of 2022, on the new 2023 subset. We examine both classifiers' errors and perform a comparative analysis of these subsets to investigate which changes in narratives provoke drops in performance.

Research paper thumbnail of Verbreitungsmechanismen sch\"adigender Sprache im Netz: Anatomie zweier Shitstorms

arXiv (Cornell University), Dec 11, 2023

Research paper thumbnail of Telegram chat corpus

Research paper thumbnail of PapagAI:Automated Feedback for Reflective Essays

arXiv (Cornell University), Jul 10, 2023

Written reflective practice is a regular exercise pre-service teachers perform during their highe... more Written reflective practice is a regular exercise pre-service teachers perform during their higher education. Usually, their lecturers are expected to provide individual feedback, which can be a challenging task to perform on a regular basis. In this paper, we present the first open-source automated feedback tool based on didactic theory and implemented as a hybrid AI system. We describe the components and discuss the advantages and disadvantages of our system compared to the state-of-art generative large language models. The main objective of our work is to enable better learning outcomes for students and to complement the teaching activities of lecturers.

Research paper thumbnail of Automated Content Moderation Using Transparent Solutions and Linguistic Expertise

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Since the dawn of Transformer-based models, the trade-off between transparency and accuracy has b... more Since the dawn of Transformer-based models, the trade-off between transparency and accuracy has been a topical issue in the NLP community. Working towards ethical and transparent automated content moderation (ACM), my goal is to find where it is still relevant to implement linguistic expertise. I show that transparent statistical models based on linguistic knowledge can still be competitive, while linguistic features have many other useful applications.

Research paper thumbnail of Automated multilingual detection of Pro-Kremlin propaganda in newspapers and Telegram posts

arXiv (Cornell University), Jan 25, 2023

Research paper thumbnail of Automated Identification of Discourse Connectives in Ukrainian

Springer International Publishing eBooks, 2022

Research paper thumbnail of Automated Multilingual Detection of Pro-Kremlin Propaganda in Newspapers and Telegram Posts

Datenbank-Spektrum

The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amo... more The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amount of news articles and social media data reflecting opposing ideologies and narratives. These polarized campaigns have led to mutual accusations of misinformation and fake news, shaping an atmosphere of confusion and mistrust for readers worldwide. This study analyses how the media affected and mirrored public opinion during the first month of the war using news articles and Telegram news channels in Ukrainian, Russian, Romanian, French and English. We propose and compare two methods of multilingual automated pro-Kremlin propaganda identification, based on Transformers and linguistic features. We analyse the advantages and disadvantages of both methods, their adaptability to new genres and languages, and ethical considerations of their usage for content moderation. With this work, we aim to lay the foundation for further development of moderation tools tailored to the current conflict.

Research paper thumbnail of A Computational Lexicon of Ukrainian Discourse Connectives

We introduce a new lexicon of discourse connectives for the Ukrainian language. Discourse connect... more We introduce a new lexicon of discourse connectives for the Ukrainian language. Discourse connectives like ‘because’, ‘therefore’ are grammatical elements which link clauses and sentences semantically and play a crucial role in discourse structure. They have shown to be useful for many tasks in natural language processing from argumentation mining to authorship analysis. We introduce a semi-automatic method for inventorizing discourse connectives in underresourced languages, by leveraging existing lexicons from other languages. As a result, we provide the rst computer-readable lexicon of 129 Ukrainian discourse connectives. We provide syntactic as well as semantic information for these items. Finally, we carry out a small pilot study using the lexicon for discourse level corpus annotation, and report on the distribution of connectives in Ukrainian in two di‌erent types of media.

Research paper thumbnail of The Telegram Chronicles of Online Harm

Journal of Open Humanities Data

Harmful language is frequent in social media, in particular in spaces which are considered anonym... more Harmful language is frequent in social media, in particular in spaces which are considered anonymous and/or allow free participation. In this paper, we analyze the language in a Telegram channel populated by followers of former US President Donald Trump. We seek to identify the ways in which harmful language is used to create a specific narrative in a group of mostly like-minded discussants. Our research has several aims. First, we create an extended taxonomy of potentially harmful language that includes not only hate speech and direct insults (which have been the focus of existing computational methods), but also other forms of harmful speech discussed in the literature. We manually apply this taxonomy to a large portion of the corpus, including the time period leading up to and the aftermath of the January 2021 US Capitol riot. Our data gives empirical evidence for harmful speech, such as in/outgroup divisive language and the use of codes within certain communities, that have not often been investigated before. Second, we compare our manual annotations of harmful speech to several automatic methods for classifying hate speech and offensive language, namely list-based and machine-learning-based approaches. We find that the Telegram data sets still pose particular challenges for these automatic methods. Finally, we argue for the value of studying such naturally-occurring, coherent data sets for research on online harm and how to address it in linguistics and philosophy.

Research paper thumbnail of A Telegram Corpus for Hate Speech, Offensive Language, and Online Harm

Journal of Open Humanities Data, 2021

We provide a new text corpus from the social medium Telegram, which is rich in indirect forms of ... more We provide a new text corpus from the social medium Telegram, which is rich in indirect forms of divisive speech. We scraped all messages from one channel of Donald Trump supporters, covering a large part of his presidency, from late 2016 until January 2021, including the January 6 Capitol riot. The discussion among the group members, over this long time period, includes the spread of disinformation, disparaging of out-group members, and other forms of harmful speech. To enable research into the role of harmful speech in political discourse, we added two types of annotations to the corpus: (i) automatic annotations of offensive language for all messages, and (ii) our own manual annotations of harmful language for a portion of the posts leading up to the January 2021 Capitol riot and its aftermath.

Research paper thumbnail of Implications of the New Regulation Proposed by the European Commission on Automatic Content Moderation

2021 ISCA Symposium on Security and Privacy in Speech Communication

Research paper thumbnail of Adapting Coreference Resolution to Twitter Conversations

Findings of the Association for Computational Linguistics: EMNLP 2020