Sergei Koltsov | National Research University Higher School of Economics (original) (raw)

Papers by Sergei Koltsov

Research paper thumbnail of Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy

Entropy, 2020

Topic modeling is a popular technique for clustering large collections of text documents. A varie... more Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-kn...

Research paper thumbnail of Changes in the Topical Structure of Russian-Language LiveJournal: The Impact of Elections 2011

SSRN Electronic Journal, 2013

This study investigates the topical structure of the Russian-language blog-publishing service Liv... more This study investigates the topical structure of the Russian-language blog-publishing service LiveJournal and the change in it that occurred in the course of the public activity after the State Duma elections in December 2011 as compared to a previous "control" period (November 27-December 27 and August 15-September 15 respectively). The data for both periods have been automatically obtained from 2000 top-rated blogs on the basis of ratings published by LiveJournal. Unsupervised topic modelling of the sampled posts was done using Latent Dirichlet Allocation algorithm. In December 2011 we found considerable growth in weights of all the topics closely associated with the discussion of voting results and protests, accompanied by a more moderate decrease in the majority of other social topics. the number of users who started posting texts that may be conventionally qualified as political according to LDA in December 2011, considerably outnumbers the number of those who ceased posting political items, which may indicate the existence of a blogger mobilization process in political topics.

Research paper thumbnail of Общественное мнение онлайн: сравнение структуры и тематики постов «обычных» и «популярных» блогеров Живого Журнала (Vox Populi Online: The Comparison of Posts' Structure and Topics Among the "Regular" and "Popular" Bloggers on LiveJournal)

International Joint Conference on the Analysis of Images, Social Networks and Texts, 2014

Аннотация. Статья посвящена сравнению тематической структуры и основных статистических параметров... more Аннотация. Статья посвящена сравнению тематической структуры и основных статистических параметров постов «обычных» и «популярных» блогеров Живого Журнала. Исследование показало существенное тематическое сходство обеих выборок, была опровергнута гипотеза о большем интересе «топовых» блогеров к социально-политическим темам по сравнению с обычными блогерами. Различие между двумя группами заключается в меньшей активности и большей зашумленности данных среди «обычных» пользователей.

Research paper thumbnail of ISMW FRUCT 2016 Summer School

Research paper thumbnail of A Thermodynamic Approach to problems in Topic Modeling

Research paper thumbnail of Application of Rényi and Tsallis entropies to topic modeling optimization

Physica A: Statistical Mechanics and its Applications, 2018

Research paper thumbnail of Stable Topic Modeling with Local Density Regularization

Lecture Notes in Computer Science, 2016

Research paper thumbnail of Stable topic modeling for web science

Proceedings of the 8th ACM Conference on Web Science - WebSci '16, 2016

Research paper thumbnail of Topic Modeling Stability and Granulated LDA

In this work, we investigate the instability of the LDA algorithm, proposing a new metric of simi... more In this work, we investigate the instability of the LDA algorithm, proposing a new metric of similarity between topics. We show the limitations of LDA for the purposes of qualitative analysis in social sciences. We also propose a new way to improve the LDA model: the Granulated LDA (GLDA) extension that shows promising stability results.

Research paper thumbnail of Mapping the public agenda with topic modeling: The case of the Russian livejournal

Policy & Internet, 2013

ABSTRACT This article describes agendas as “packages” of topics of varying salience, set by the R... more ABSTRACT This article describes agendas as “packages” of topics of varying salience, set by the Russian Internet users on Russia's leading blog platform LiveJournal. The research involved modeling LiveJournal's topic structure, viewed as an important component of what is termed here self-generated public opinion. Topic modeling was performed automatically with the LDA algorithm, and complemented with hand labeling of topics. Data were collected by software created by the authors to generate a relational database storing all posts by the top 2,000 LiveJournal users from three one-month periods: two during the Russian parliamentary and presidential elections 2011–2012, and one control period. We find that LiveJournal top users share their attention evenly between “social/political” and “private/recreational” issues, the proportion being very stable. However, the substitution of diverse public affairs issues by the topics related to national street protests in the politicized periods compared to the control period was found both automatically and manually. The group of topics centered around social issues demonstrates the biggest volatility in terms of its composition and may serve as the foundation for monitoring self-generated public opinion by further application of sentiment/opinion mining methods to these topics.

Research paper thumbnail of Do ordinary bloggers really differ from blog celebrities?

Proceedings of the 2014 ACM conference on Web science - WebSci '14, 2014

ABSTRACT In this paper we describe structural and topical properties of "ordinary&qu... more ABSTRACT In this paper we describe structural and topical properties of "ordinary" blogs versus "popular" blogs. Using the complete directory of the Russian language LiveJournal, we sample both groups and show that the main difference between them is in the volume of posting activity and of commenting feedback and in the skewedness of respective distributions. No substantial differences in topical structure obtained with the LDA algorithm are found, which suggests that ordinary bloggers do not hold specific vision of topic salience and do not set their own "grassroots" agendas.

Research paper thumbnail of An AIDS-Denialist Online Community on a Russian Social Networking Service: Patterns of Interactions With Newcomers and Rhetorical Strategies of Persuasion

Journal of Medical Internet Research, 2014

Research paper thumbnail of Interval Semi-supervised LDA: Classifying Needles in a Haystack

Abstract. An important text mining problem is to find, in a large collection of texts, documents ... more Abstract. An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.

Research paper thumbnail of When Internet Really Connects Across Space: Communities of Software Developers in Vkontakte Social Networking Site

Following the discussion on the role of Internet in the formation of ties across space, this pape... more Following the discussion on the role of Internet in the formation of ties across space, this paper seeks to supplement recent findings on prevalence of location-dependent preferential attachment online. We look at networks of online communities specifically aimed at development of location-independent ties. The paper focuses on the 25 largest communities of software developers in the leading Russian social networking site VKontakte, one of the communities being studied in depth. Evidence suggests that membership and friendship ties are overwhelmingly cross-city and even cross-country, while an in-depth analysis gives ground to assume that, commenting and liking in such communities might also be location-independent. This group case study provides some insights into a nature of professional networking and shows independence of the three networks: the friendship network as a means of group identification, the commenting network as an advice-giving tool, and the liking network as a res...

Research paper thumbnail of Общественное мнение онлайн: сравнение структуры и тематики постов «обычных» и «популярных» блогеров Живого Журнала (Vox Populi Online: The Comparison of Posts' Structure and Topics Among the "Regular" and "Popular" Bloggers on LiveJournal)

The paper is devoted to comparison of topical structure and basic statistical parameters among th... more The paper is devoted to comparison of topical structure and basic statistical parameters among the “regular” and “popular” bloggers on LiveJournal. The study has shown a significant topical similarity between both of the user groups. The hypothesis that “popular” bloggers are more interested in social and political topics rather than “regular” ones has been rejected. The discovered difference between the groups is in “regular” users’ lesser activity and increased data noise among them.

Research paper thumbnail of Общественное мнение онлайн: сравнение структуры и тематики постов «обычных» и «популярных» блогеров Живого Журнала

Research paper thumbnail of Information Retrieval: 9th Russian Summer School, RuSSIR 2015, Saint Petersburg, Russia, August 24-28, 2015, Revised Selected Papers

This book constitutes the thoroughly refereed proceedings of the 9th Russian Summer School on Inf... more This book constitutes the thoroughly refereed proceedings of the 9th Russian Summer School on Information Retrieval, RuSSIR 2015, held in Saint Petersburg, Russia, in August 2015. The volume includes 5 tutorial papers, summarizing lectures given at the event, and 6revised papersfrom the school participants. The papers focus on various aspects of information retrieval.

Research paper thumbnail of Information Retrieval

The course is focused on one of the most popular topics in the network science: detection of comm... more The course is focused on one of the most popular topics in the network science: detection of communities in networks. Communities are usually conceived as subgraphs of a network, with a high density of links within the subgraphs and a comparatively lower density between them. I introduce the elements of the problem, e.g. definitions of community and partition, and dwelve into some of the most popular methods. Special attention is devoted to the optimization of global quality functions, like Newmna-Girvan modularity, and to their limits. Finally we discuss the crucial issue of testing, both on artificial benchmark graphs with built-in community structure and on real networks.

Research paper thumbnail of Modeling Cascade Growth: Predicting Content Diffusion on VKontakte

Online social networks have become an essential communication channel for the broad and rapid sha... more Online social networks have become an essential communication channel for the broad and rapid sharing of information. Currently, the mechanics of such information-sharing is captured by the notion of cascades, which are tree-like networks comprised of (re)sharing actions. However, it is still unclear what factors drive cascade growth. Moreover, there is a lack of studies outside Western countries and platforms such as Facebook and Twitter. In this work, we aim to investigate what factors contribute to the scope of information cascading and how to predict this variation accurately. We examine six machine learning algorithms for their predictive and interpretative capabilities concerning cascades' structural metrics (width, mass, and depth). To do so, we use data from a leading Russian-language online social network VKontakte capturing cascades of 4,424 messages posted by 14 news outlets during a year. The results show that the best models in terms of predictive power are Gradient Boosting algorithm for width and depth, and Lasso Regression algorithm for the mass of a cascade, while depth is the least predictable. We find that the most potent factor associated with cascade size is the number of reposts on its origin level. We examine its role along with other factors such as content features and characteristics of sources and their audiences.

Research paper thumbnail of From hydrolysis to the formation of colloids: polymerization of tetravalent actinide ions

Wissenschaftliche …, 2008

Polymerization reactions of tetravalent metal ions in solution gained considerable renewed intere... more Polymerization reactions of tetravalent metal ions in solution gained considerable renewed interest in recent years but are still not fully understood. The relvance of their complex chemistry spans from industrial applications in the case of zirconium salts to the nuclear fuel cycle but also ...

Research paper thumbnail of Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy

Entropy, 2020

Topic modeling is a popular technique for clustering large collections of text documents. A varie... more Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-kn...

Research paper thumbnail of Changes in the Topical Structure of Russian-Language LiveJournal: The Impact of Elections 2011

SSRN Electronic Journal, 2013

This study investigates the topical structure of the Russian-language blog-publishing service Liv... more This study investigates the topical structure of the Russian-language blog-publishing service LiveJournal and the change in it that occurred in the course of the public activity after the State Duma elections in December 2011 as compared to a previous "control" period (November 27-December 27 and August 15-September 15 respectively). The data for both periods have been automatically obtained from 2000 top-rated blogs on the basis of ratings published by LiveJournal. Unsupervised topic modelling of the sampled posts was done using Latent Dirichlet Allocation algorithm. In December 2011 we found considerable growth in weights of all the topics closely associated with the discussion of voting results and protests, accompanied by a more moderate decrease in the majority of other social topics. the number of users who started posting texts that may be conventionally qualified as political according to LDA in December 2011, considerably outnumbers the number of those who ceased posting political items, which may indicate the existence of a blogger mobilization process in political topics.

Research paper thumbnail of Общественное мнение онлайн: сравнение структуры и тематики постов «обычных» и «популярных» блогеров Живого Журнала (Vox Populi Online: The Comparison of Posts' Structure and Topics Among the "Regular" and "Popular" Bloggers on LiveJournal)

International Joint Conference on the Analysis of Images, Social Networks and Texts, 2014

Аннотация. Статья посвящена сравнению тематической структуры и основных статистических параметров... more Аннотация. Статья посвящена сравнению тематической структуры и основных статистических параметров постов «обычных» и «популярных» блогеров Живого Журнала. Исследование показало существенное тематическое сходство обеих выборок, была опровергнута гипотеза о большем интересе «топовых» блогеров к социально-политическим темам по сравнению с обычными блогерами. Различие между двумя группами заключается в меньшей активности и большей зашумленности данных среди «обычных» пользователей.

Research paper thumbnail of ISMW FRUCT 2016 Summer School

Research paper thumbnail of A Thermodynamic Approach to problems in Topic Modeling

Research paper thumbnail of Application of Rényi and Tsallis entropies to topic modeling optimization

Physica A: Statistical Mechanics and its Applications, 2018

Research paper thumbnail of Stable Topic Modeling with Local Density Regularization

Lecture Notes in Computer Science, 2016

Research paper thumbnail of Stable topic modeling for web science

Proceedings of the 8th ACM Conference on Web Science - WebSci '16, 2016

Research paper thumbnail of Topic Modeling Stability and Granulated LDA

In this work, we investigate the instability of the LDA algorithm, proposing a new metric of simi... more In this work, we investigate the instability of the LDA algorithm, proposing a new metric of similarity between topics. We show the limitations of LDA for the purposes of qualitative analysis in social sciences. We also propose a new way to improve the LDA model: the Granulated LDA (GLDA) extension that shows promising stability results.

Research paper thumbnail of Mapping the public agenda with topic modeling: The case of the Russian livejournal

Policy & Internet, 2013

ABSTRACT This article describes agendas as “packages” of topics of varying salience, set by the R... more ABSTRACT This article describes agendas as “packages” of topics of varying salience, set by the Russian Internet users on Russia's leading blog platform LiveJournal. The research involved modeling LiveJournal's topic structure, viewed as an important component of what is termed here self-generated public opinion. Topic modeling was performed automatically with the LDA algorithm, and complemented with hand labeling of topics. Data were collected by software created by the authors to generate a relational database storing all posts by the top 2,000 LiveJournal users from three one-month periods: two during the Russian parliamentary and presidential elections 2011–2012, and one control period. We find that LiveJournal top users share their attention evenly between “social/political” and “private/recreational” issues, the proportion being very stable. However, the substitution of diverse public affairs issues by the topics related to national street protests in the politicized periods compared to the control period was found both automatically and manually. The group of topics centered around social issues demonstrates the biggest volatility in terms of its composition and may serve as the foundation for monitoring self-generated public opinion by further application of sentiment/opinion mining methods to these topics.

Research paper thumbnail of Do ordinary bloggers really differ from blog celebrities?

Proceedings of the 2014 ACM conference on Web science - WebSci '14, 2014

ABSTRACT In this paper we describe structural and topical properties of "ordinary&qu... more ABSTRACT In this paper we describe structural and topical properties of "ordinary" blogs versus "popular" blogs. Using the complete directory of the Russian language LiveJournal, we sample both groups and show that the main difference between them is in the volume of posting activity and of commenting feedback and in the skewedness of respective distributions. No substantial differences in topical structure obtained with the LDA algorithm are found, which suggests that ordinary bloggers do not hold specific vision of topic salience and do not set their own "grassroots" agendas.

Research paper thumbnail of An AIDS-Denialist Online Community on a Russian Social Networking Service: Patterns of Interactions With Newcomers and Rhetorical Strategies of Persuasion

Journal of Medical Internet Research, 2014

Research paper thumbnail of Interval Semi-supervised LDA: Classifying Needles in a Haystack

Abstract. An important text mining problem is to find, in a large collection of texts, documents ... more Abstract. An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.

Research paper thumbnail of When Internet Really Connects Across Space: Communities of Software Developers in Vkontakte Social Networking Site

Following the discussion on the role of Internet in the formation of ties across space, this pape... more Following the discussion on the role of Internet in the formation of ties across space, this paper seeks to supplement recent findings on prevalence of location-dependent preferential attachment online. We look at networks of online communities specifically aimed at development of location-independent ties. The paper focuses on the 25 largest communities of software developers in the leading Russian social networking site VKontakte, one of the communities being studied in depth. Evidence suggests that membership and friendship ties are overwhelmingly cross-city and even cross-country, while an in-depth analysis gives ground to assume that, commenting and liking in such communities might also be location-independent. This group case study provides some insights into a nature of professional networking and shows independence of the three networks: the friendship network as a means of group identification, the commenting network as an advice-giving tool, and the liking network as a res...

Research paper thumbnail of Общественное мнение онлайн: сравнение структуры и тематики постов «обычных» и «популярных» блогеров Живого Журнала (Vox Populi Online: The Comparison of Posts' Structure and Topics Among the "Regular" and "Popular" Bloggers on LiveJournal)

The paper is devoted to comparison of topical structure and basic statistical parameters among th... more The paper is devoted to comparison of topical structure and basic statistical parameters among the “regular” and “popular” bloggers on LiveJournal. The study has shown a significant topical similarity between both of the user groups. The hypothesis that “popular” bloggers are more interested in social and political topics rather than “regular” ones has been rejected. The discovered difference between the groups is in “regular” users’ lesser activity and increased data noise among them.

Research paper thumbnail of Общественное мнение онлайн: сравнение структуры и тематики постов «обычных» и «популярных» блогеров Живого Журнала

Research paper thumbnail of Information Retrieval: 9th Russian Summer School, RuSSIR 2015, Saint Petersburg, Russia, August 24-28, 2015, Revised Selected Papers

This book constitutes the thoroughly refereed proceedings of the 9th Russian Summer School on Inf... more This book constitutes the thoroughly refereed proceedings of the 9th Russian Summer School on Information Retrieval, RuSSIR 2015, held in Saint Petersburg, Russia, in August 2015. The volume includes 5 tutorial papers, summarizing lectures given at the event, and 6revised papersfrom the school participants. The papers focus on various aspects of information retrieval.

Research paper thumbnail of Information Retrieval

The course is focused on one of the most popular topics in the network science: detection of comm... more The course is focused on one of the most popular topics in the network science: detection of communities in networks. Communities are usually conceived as subgraphs of a network, with a high density of links within the subgraphs and a comparatively lower density between them. I introduce the elements of the problem, e.g. definitions of community and partition, and dwelve into some of the most popular methods. Special attention is devoted to the optimization of global quality functions, like Newmna-Girvan modularity, and to their limits. Finally we discuss the crucial issue of testing, both on artificial benchmark graphs with built-in community structure and on real networks.

Research paper thumbnail of Modeling Cascade Growth: Predicting Content Diffusion on VKontakte

Online social networks have become an essential communication channel for the broad and rapid sha... more Online social networks have become an essential communication channel for the broad and rapid sharing of information. Currently, the mechanics of such information-sharing is captured by the notion of cascades, which are tree-like networks comprised of (re)sharing actions. However, it is still unclear what factors drive cascade growth. Moreover, there is a lack of studies outside Western countries and platforms such as Facebook and Twitter. In this work, we aim to investigate what factors contribute to the scope of information cascading and how to predict this variation accurately. We examine six machine learning algorithms for their predictive and interpretative capabilities concerning cascades' structural metrics (width, mass, and depth). To do so, we use data from a leading Russian-language online social network VKontakte capturing cascades of 4,424 messages posted by 14 news outlets during a year. The results show that the best models in terms of predictive power are Gradient Boosting algorithm for width and depth, and Lasso Regression algorithm for the mass of a cascade, while depth is the least predictable. We find that the most potent factor associated with cascade size is the number of reposts on its origin level. We examine its role along with other factors such as content features and characteristics of sources and their audiences.

Research paper thumbnail of From hydrolysis to the formation of colloids: polymerization of tetravalent actinide ions

Wissenschaftliche …, 2008

Polymerization reactions of tetravalent metal ions in solution gained considerable renewed intere... more Polymerization reactions of tetravalent metal ions in solution gained considerable renewed interest in recent years but are still not fully understood. The relvance of their complex chemistry spans from industrial applications in the case of zirconium salts to the nuclear fuel cycle but also ...

Research paper thumbnail of Interval Semi-Supervised LDA: Classifying Needles in a Haystack

An important text mining problem is to find, in a large collection of texts, documents related to... more An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especilly important for social sciences, where the purpose is to find the representative documents for subsequent qualitative interpretation. To solvw this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discource analysis