Topic modeling Research Papers - Academia.edu (original) (raw)

Objectives. Infodemics of false information on social media is a growing societal problem, aggravated by the occurrence of the COVID-19 pandemic. The development of infodemics has characteristic resemblances to epidemics of infectious... more

Objectives. Infodemics of false information on social media is a growing societal problem, aggravated by the occurrence of the COVID-19 pandemic. The development of infodemics has characteristic resemblances to epidemics of infectious diseases. This paper presents several methodologies which aim to measure the extent and development of infodemics through the lens of epidemiology. Methods. Time varying R was used as a measure for the infectiousness of the infodemic, topic modeling was used to create topic clouds and topic similarity heat maps, while network analysis was used to create directed and undirected graphs to identify super-spreader and multiple carrier communities on social media. Results. Forty-two (42) latent topics were discovered. Reproductive trends for a specific topic were observed to have significantly higher peaks (Rt 4-5) than general misinformation (Rt 1-3). From a sample of social media misinformation posts, a total of 385 groups and 804 connections were found w...

Analysis and modeling of crime text report data has important applications, including refinement of crime classifications, clustering of documents, and feature extraction for spatiotemporal forecasts. Having better neural network... more

Analysis and modeling of crime text report data has important applications, including refinement of crime classifications, clustering of documents, and feature extraction for spatiotemporal forecasts. Having better neural network representations
of crime text data may facilitate all of these tasks. This paper evaluates the ability of generative adversarial network models to represent crime text data and generate realistic crime reports. We compare four state-of-the-art GAN algorithms in terms of quantitative metrics such as coherence, embedding similarity, negative log-likelihood, and qualitatively based on inspection of generated text. We discuss current challenges with crime text
representation and directions for future research

The World Wide Web is now the primary source for information discovery. A user visits websites that provide information and browse on the particular information in accordance with their topic interest. Through the navigational process,... more

The World Wide Web is now the primary source for information discovery. A user visits websites that provide information and browse on the particular information in accordance with their topic interest. Through the navigational process, visitors often had to jump over the menu to find the right content. Recommendation system can help the visitors to find the right content immediately. In this study, we propose a two-level recommendation system, based on association rule and topic similarity. We generate association rule by applying Apriori algorithm. The dataset for association rule mining is a session of topics that made by combining the result of sessionization and topic modeling. On the other hand, the topic similarity made by comparing the topic proportion of web article. This topic proportion inferred from the Latent Dirichlet Allocation (LDA). The results show that in our dataset there are not many interesting topic relations in one session. This result can be resolved, by utilizing the second level of recommendation by looking into the article that has the similar topic.

The dynamic nature of cities, understood as complex systems with a variety of concurring factors, poses significant challenges to urban analysis for supporting planning processes. This particularly applies to large urban events because... more

The dynamic nature of cities, understood as complex systems with a variety of concurring factors, poses significant challenges to urban analysis for supporting planning processes. This particularly applies to large urban events because their characteristics often contradict daily planning routines. Due to the availability of large amounts of data, social media offer the possibility for fine-scale spatial and temporal analysis in this context, especially regarding public emotions related to varied topics. Thus, this article proposes a combined approach for analyzing large sports events considering event days vs comparison days (before or after the event) and different user groups (residents vs visitors), as well as integrating sentiment analysis and topic extraction. Our results based on various analyses of tweets demonstrate that different spatial and temporal patterns can be identified, clearly distinguishing both residents and visitors, along with positive or negative sentiment. Furthermore, we could assign tweets to specific urban events or extract topics related to the transportation infrastructure. Although the results are potentially able to support urban planning processes of large events, the approach still shows some limitations including well-known biases in social media or shortcomings in identifying the user groups and in the topic modeling approach.

The data deluge has created a great challenge for data mining applications wherein the rare topics of interest are often buried in the flood of major headlines. We identify and formulate a novel problem: cross-channel anomaly detection... more

The data deluge has created a great challenge for data mining applications wherein the rare topics of interest are often buried in the flood of major headlines. We identify and formulate a novel problem: cross-channel anomaly detection from multiple data channels. Cross-channel anomalies are common among the individual channel anomalies and are often portent of significant events. Central to this new problem is a development of theoretical foundation and methodology. Using the spectral approach, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single-channel anomalies. We also derive the extension of the proposed detection method to an online settings, which automatically adapts to changes in the data over time at low computational complexity using incremental algorithms. Our mathematical analysis shows that our method is likely to reduce the false alarm rate by establishing theoretical results on the reduction of an impurity index. We demonstrate our method in two applications: document understanding with multiple text corpora and detection of repeated anomalies in large-scale video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large-scale data stream analysis.

The paper presents the topic modeling technique known as Latent Dirichlet Allocation (LDA), a form of text-mining aiming at discovering the hidden (latent) thematic structure in large archives of documents. By applying LDA to the full... more

The paper presents the topic modeling technique known as Latent Dirichlet Allocation (LDA), a form of text-mining aiming at discovering the hidden (latent) thematic structure in large archives of documents. By applying LDA to the full text of the economics articles stored in the JSTOR database, we show how to construct a map of the discipline over time, and illustrate the potentialities of the technique for the study of the shifting structure of economics in a time of (possible) fragmentation

In this article we will apply various statistical analyses with the R programming language (word frequency, bigrams, word co-occurrence and Topic Modelling) to the bibliographical reviews of 1932-1933 from the Spanish journal Índice... more

In this article we will apply various statistical analyses with the R programming language (word frequency, bigrams, word co-occurrence and Topic Modelling) to the bibliographical reviews of 1932-1933 from the Spanish journal Índice Literario (1932-1936) in order to bring us closer (i) to the vision of the collaborators on Índice Literario in the general context of the literature written in Spain in the 1930s and, particularly, in the literary work written by women and (ii) to the social and political situation in Spain during these years which is reflected in the work of the authors reviewed and in the journal reviews. Índice Literario was part of the section “Archivos de Literatura Española Contemporánea” (“Contemporary Spanish Literature Archives”) in the Centre for Historical Studies (1910-1939) and was managed by the Spanish poet Pedro Salinas (1891-1951) and a group of collaborators. The journal was a clear attempt by the Centre for Historical Studies to open itself up to the literary scene of the time and provides an indispensable source of information for getting to know the cultural reality during the years of the Second Spanish Republic. The article concludes that the quantitative analysis (the discovery of patterns, such as trends in the use of words or the identification of the most important subjects in the reviews included in the journal) complements and enriches the qualitative analysis of the publication in a satisfactory way.

The number of user-contributed comments is increasing exponentially. Such comments are found widely in social media sites including internet discussion forums and news agency websites. In this paper, we summarize the current approaches to... more

The number of user-contributed comments is
increasing exponentially. Such comments are found
widely in social media sites including internet
discussion forums and news agency websites. In this
paper, we summarize the current approaches to text
analysis and the visualization tools which deal with
opinion and topics mining of those comments. We
then describe experiments for topic modeling on
users' comments and examine the possible extensions
of methods on visualization, sentiment analysis and
opinion summarization systems.

The current study used structural topic modeling to investigate the ways in which news of the 2017 Quebec mosque shooting mobilized global public discourse on Twitter. The resulting globally generated Twitter conversations were divided... more

The current study used structural topic modeling to investigate the ways in which news of the 2017 Quebec mosque shooting mobilized global public discourse on Twitter. The resulting globally generated Twitter conversations were divided into 9 relevant topics, the prevalence of which were examined based on geographic and informational proximity to the location of the incident. Tweets posted from locations geographically closer to the shooting location prevalently incorporated individual-oriented and conflict-focused storytelling. Conversely, tweets geographically farther from the incident prevalently featured macro-narratives that pointed to societal implications. This study also explored informational distance, which defines the ability to access to in-depth news sources. Results showed that there were topical differences between journalist/institutional tweets and laymen tweets. This study concludes that proximity influences global conversations related to hate crime news.

This paper focuses on one type of content, text, which in itself is complex and requires a significant understanding of language and human cognition. Traditionally, researchers used what is now known as the manual approach to carrying out... more

This paper focuses on one type of content, text, which in itself is complex and requires a significant understanding of language and human cognition. Traditionally, researchers used what is now known as the manual approach to carrying out a content analysis of text. This involves humans manually coding and analyzing the text. In the 1980s, the computer-aided approach for content analysis of text was developed for researchers to automate, at least partially, the coding and analysis. Software programs are used to manipulate text and compute word frequency lists, keyword-in-context lists, concordances, classifications of text in terms of content categories, category counts, and so on-results of which human researchers then interpret (Deffner, 1986).

Purpose The purpose of this paper is to introduce, apply and compare how artificial intelligence (AI), and specifically the IBM Watson system, can be used for content analysis in marketing research relative to manual and computer-aided... more

Purpose The purpose of this paper is to introduce, apply and compare how artificial intelligence (AI), and specifically the IBM Watson system, can be used for content analysis in marketing research relative to manual and computer-aided (non-AI) approaches to content analysis. Design/methodology/approach To illustrate the use of AI-enabled content analysis, this paper examines the text of leadership speeches, content related to organizational brand. The process and results of using AI are compared to manual and computer-aided approaches by using three performance factors for content analysis: reliability, validity and efficiency. Findings Relative to manual and computer-aided approaches, AI-enabled content analysis provides clear advantages with high reliability, high validity and moderate efficiency. Research limitations/implications This paper offers three contributions. First, it highlights the continued importance of the content analysis research method, particularly with the exp...

With the increasing implementation of Building Information Modeling, the quantity surveyors’ fundamental responsibility of measuring and pricing is being challenged. For the continuation of their professional position within the industry,... more

With the increasing implementation of Building Information Modeling, the quantity surveyors’ fundamental responsibility of measuring and pricing is being challenged. For the continuation of their professional position within the industry, quantity surveyors need to be able to find information and knowledge based services provided, where requires quantity surveyor’s insight view and analysis on the information from various sources. However, throughout the building process quantity surveyors make decisions based on subjective judgments about the value and quality of the information; still they rarely find the key information to get the task done. The value and quality of information are both different in nature; information quality is context-independent, while the value of information content-dependent. The quality and value of information are inherently difficult to quantify, and so there is a lack of methodologies on assessing information value and quality. The research poses the following challenges to quantity surveyors: "How can we identify high value information within quantity surveying firms? Is it possible to establish a filter mechanism to record the high value information for reusing and help quantity surveyors judging the value of information?" This paper looks to investigate quantity surveyor’s information life-cycle within design centre environment. Investigation throughout people, process, and technology indicate that existing working flow is hard to be changed. At current stage of BIM, its contribution is focusing on bills of quantities. Because of lacking of a sufficient filter mechanism, quantity surveyor can not provide the insight view of information, where already been recorded throughout construction project process, e.g. Life Cycling Costing (LCC). Technology is available on providing such service, but the development on such system is stopped.

John Taylor’s five-volume documentary collection The Philippine Insurrection against the United States is an invaluable resource on the Philippine-American War. However, the criteria with which Taylor selected the documents for his... more

John Taylor’s five-volume documentary collection The Philippine Insurrection against the United States is an invaluable resource on the Philippine-American War. However, the criteria with which Taylor selected the documents for his collection have remained opaque. The present methodological paper explores the use of Topic Modeling, a tool from the Digital Humanities, to posthumously interrogate the archivist. By comparing the preliminary results of successive manual and automatic coding, its findings highlight the crucial role that materials external to a corpus play in helping the researcher to gauge whether or not particular strings of co-occurrent words are relevant to a research project.

Purpose-The purpose of this paper is to introduce, apply and compare how artificial intelligence (AI), and specifically the IBM Watson system, can be used for content analysis in marketing research relative to manual and computer-aided... more

Purpose-The purpose of this paper is to introduce, apply and compare how artificial intelligence (AI), and specifically the IBM Watson system, can be used for content analysis in marketing research relative to manual and computer-aided (non-AI) approaches to content analysis. Design/methodology/approach-To illustrate the use of AI-enabled content analysis, this paper examines the text of leadership speeches, content related to organizational brand. The process and results of using AI are compared to manual and computer-aided approaches by using three performance factors for content analysis: reliability, validity and efficiency. Findings-Relative to manual and computer-aided approaches, AI-enabled content analysis provides clear advantages with high reliability, high validity and moderate efficiency. Research limitations/implications-This paper offers three contributions. First, it highlights the continued importance of the content analysis research method, particularly with the explosive growth of natural language-based user-generated content. Second, it provides a road map of how to use AI-enabled content analysis. Third, it applies and compares AI-enabled content analysis to manual and computer-aided, using leadership speeches. Practical implications-For each of the three approaches, nine steps are outlined and described to allow for replicability of this study. The advantages and disadvantages of using AI for content analysis are discussed. Together these are intended to motivate and guide researchers to apply and develop AI-enabled content analysis for research in marketing and other disciplines. Originality/value-To the best of the authors' knowledge, this paper is among the first to introduce, apply and compare how AI can be used for content analysis.

Increasingly, management researchers are using topic modeling, a new method borrowed from computer science, to reveal phenomenon-based constructs and grounded conceptual relationships in textual data. By conceptualizing topic modeling as... more

Increasingly, management researchers are using topic modeling, a new method borrowed from computer science, to reveal phenomenon-based constructs and grounded conceptual relationships in textual data. By conceptualizing topic modeling as the process of rendering constructs and conceptual relationships from textual data, we demonstrate how this new method can advance management scholarship without turning topic modeling into a black box of complex computer-driven algorithms. We begin by comparing features of topic modeling to related techniques (content analysis, grounded theorizing, and natural language processing). We then walk through the steps of rendering with topic modeling and apply rendering to management articles that draw on topic modeling. Doing so enables us to identify and discuss how topic modeling has advanced management theory in five areas: detecting novelty and emergence, developing inductive classification systems, understanding online audiences and products, analyzing frames and social movements, and understanding cultural dynamics. We conclude with a review of new topic modeling trends and revisit the role of researcher interpretation in a world of computer-driven textual analysis. We would like to thank the editors of the Academy of Management Annals for their support and helpful comments. We also thank the participants in our various topic modeling presentations and reviewers and division organizers (specifically Peer Fiss and Renate Meyer) at the Academy of Management meetings. In addition, we would like to recognize Marc-David Seidel and Christopher Steele from the Interpretive Data Science (IDeaS) group for their role in germinating these ideas, Mike Pfarrer for his comments on a later draft of the paper, and Kara Gehman for her fine-grained edits on next-to-final drafts. Finally, we wish to express our appreciation to our life partners for not only putting up with but also actively discussing this paper as it evolved.

Electronic Theses and Dissertations (ETDs) poses the challenge of managing and extraction of appropriate knowledge for decision making. To tackle the same, topic modeling was first applied to Library and Information Science (LIS) theses... more

Electronic Theses and Dissertations (ETDs) poses the challenge of managing and extraction of appropriate knowledge for decision making. To tackle the same, topic modeling was first applied to Library and Information Science (LIS) theses submitted to Shodhganga (an Indian ETDs digital repository) to determine the five core topics/tags and then the performance of the built model based on those topics/tags were analyzed. Using a Latent Dirichlet Allocation based Topic-Modeling-Toolkit, the five core topics were found to be information literacy, user studies, scientometrics, library resources and library services for the epoch 2013-2017 and consequently all the theses were summarized with the presence of their respective topic proportion for the tags/topics. A Support Vector Machine (Linear) prediction model using RapidMiner toolbox was created and showed 88.78% accuracy with 0.85 kappa value.

Most government organizations now have their own space on the web permitting citizen to find information and progressively participate in e-government. There are only few studies on e-government adoption and address the concern of the... more

Most government organizations now have their own space on the web permitting citizen to find information and progressively participate in e-government. There are only few studies on e-government adoption and address the concern of the citizens in the different government services in the developing countries like Philippines. This study focuses on the development of government portal that solicits public feedbacks and complaints with the help of different ICT tools: web crawling, tag cloud, topic modelling and social media networking sites. The system can help to reveal the citizens' needs and expectations in the government services through e-government platform. Also, it uses web crawling technology to crawl the news, articles, updates, complaints of the top 10 most complaints government agencies' websites. The user can give feedbacks and comments to selected websites and it will be extracted to produce text visualization using the tag cloud and topic modelling. The information gathered will be sent to the respective government agencies for them to be aware of the different citizens' complaints and make some necessary actions. In addition, agile approach was also utilized for the software development. The system helps citizens to empower and be informed of which government agency needs to improve their services. Moreover, it provides the government with more opportunities to better fulfil its responsibilities. Thus, it is recommended that the developed system be implemented as innovative modes of communication which can improve the transparency of the government and encourage citizen to participate in the government's decision making process.

In 2003, Gayatri Chakravorty Spivak published Death of a Discipline, an exhortation to create “an inclusive comparative literature,” one that “takes the languages of the Southern Hemisphere as active cultural media rather than as objects... more

In 2003, Gayatri Chakravorty Spivak published Death of a Discipline, an exhortation to create “an inclusive comparative literature,” one that “takes the languages of the Southern Hemisphere as active cultural media rather than as objects of cultural study.” To many literary scholars such a development seemed welcome and even likely. Instead, ten years later, an entirely different transformation has taken place via the development of the Digital Humanities (DH), in which the close study of literature and the languages in which it is embedded have themselves been demoted, in favor of “distant reading” and other forms of quantitative and large-scale analyses, and whose language politics have regressed rather than progressed from the state Spivak described. DH advertises itself as an unexceptionable application of computational techniques to literary scholarship, yet its advent has accompanied an almost complete reorientation of literary studies as a field—a virtual death of the vision described by Spivak. The advent of DH is quite unlike the ones accompanying the introduction of computers into other disciplines, whose basic precepts have remained largely intact in the face of digitization. DH’s paradoxical use of the adjective “digital” to describe only a fraction of research methods that engage with digital technology creates a tension that must be resolved: either by the DH label being reabsorbed into literary studies, or by literary scholarship itself being fundamentally altered, a goal which DH has already in part achieved.

What are we truly saying when we talk of JRPG? Over the years, the field of games studies has adopted the term JRPG as a viable genre category to identify a certain corpus of video games originating in Japan and that has penetrated the... more

What are we truly saying when we talk of JRPG? Over the years, the field of games studies has adopted the term JRPG as a viable genre category to identify a certain corpus of video games originating in Japan and that has penetrated the Western video game cultural landscape since the 1990s. However, a closer examination indicates that the spread of this term is deeply dependent on the global network providing the distribution and reception of these games, sometimes relying on journalistic and fan discourses to do so. Indeed, while the term is broadly used in North American and Europe, talks of a specific JRPG genre in Japan are virtually nonexistent. The notion of JRPG is a difficult object to handle in the context of academic enquiry, often leading writers to address them only in term of broad generalization cultural determinism. So far, little academic work had focused on uncovering the circumstances of the rise of JRPG as a genre denomination in gamer parlance. Douglas Schules has explored modern JRPGs in the light of the kawaii culture (2015) and has also raised academic interest in the definition of JRPG within fan communities as a negotiation between the elements of gameplay mechanics, fictional settings and Japanese exoticism, striving to pinpoint where exactly JRPGs begin to diverge from traditional RPGs understood to belong to the Western game design tradition (2012). More research is certainly warranted, but a major problem that holds back innovative work on this topic is the absence of a clear understanding of the historiography of the genre. Understanding the circumstances and the manner in which journalists and fans started to identify specific games as JRPGs, as well as how this discourse evolved, is crucial in properly evaluating how reliable and productive this term is when talking about Japanese game culture in a scholarly context. This paper is meant to provide the first step in larger project of the study of the genre by providing an outline of the evolution of the discourse surrounding Japanese role-playing

Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using... more

Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools.
Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining.
The volume argues that digital history is entering a mature phase, digital history ‘in action’, where its focus is shifting from the building of resources towards the making of new historical knowledge. This also involves novel challenges that digital methods pose to historical research, including awareness of the pitfalls and limitations of the digital tools and the necessity of new forms of digital source criticisms.
Through its combination of empirical, conceptual and contextual studies, Digital Histories is a timely and pioneering contribution taking stock of how digital research currently advances historical scholarship.

While there is currently a growing body of academic literature on Arabic Rap and Hip Hop, former research, instead of investigating the content of this genre in a systematic way, has opted for select readings of choice artists or... more

While there is currently a growing body of academic literature on Arabic Rap and Hip Hop, former research, instead of investigating the content of this genre in a systematic way, has opted for select readings of choice artists or distanced theorizations about Arabic Rap as a general phenomenon. Presently, there is either too little engagement with lyrical content and too much focus on ethnomusicological or cultural studies approaches which instead of content consider musical intertextuality, an individual artist’s biography, corpus, or their sociopolitical milieu. As a generic inquiry, this project conducts an Integrated Content Analysis to characterize the diversity of themes represented in Arabic rap lyrics as a means to asses some of the claims previous scholarship has made. A fully integrated content analysis consists of mutually informing qualitative, quantitative, computational and manual approaches to content analysis. As primary data, this project created a sample of 100 Arabic language rap and hip-hop tracks released between the years 2008-2017, which were selected using a stratified random sampling strategy. 16 of these tracks were fully analyzed. In the final conclusion, I present an adaptation of Simpson’s Index of Diversity to introduce a means to measure the thematic diversity of the genre of Arabic rap based on a given sample. This project seeks to revitalize the humanities with a bit of technical rigor and as a last attempt to resuscitate the undead, parochial, and ethically troubled discipline of Middle Eastern Studies.

Questa tesi doveva essere presentata per la discussione nella sessione di laurea magistrale in "Sociologia e ricerca sociale" del novembre 2021. A causa del comportamento straordiariamente scorretto della Prof.ssa Bracciale e del Dott.... more

Questa tesi doveva essere presentata per la discussione nella sessione di laurea magistrale in "Sociologia e ricerca sociale" del novembre 2021. A causa del comportamento straordiariamente scorretto della Prof.ssa Bracciale e del Dott. Martella dell'Unipi non è stata presentata, nonostante su di essa avessi lavorato per quasi due anni con grande impegno; ho dovuto così scegliere un nuovo oggetto e un nuovo relatore (per fortuna il nuovo lavoro - dedicato alle autoriduzioni negli anni '70 - è stato molto soddisfacente). La tesi è ancora leggermente incompleta, ma molto avanti nel processo di realizzazione. Nonostante l'esito il lavoro è stato molto interessante.

ii iii We selected 25 papers for publishing out of the 52 submissions (an acceptance rate of 48%) that we received from students from a wide variety of countries. Three papers were presented orally in one of the parallel sessions of the... more

ii iii We selected 25 papers for publishing out of the 52 submissions (an acceptance rate of 48%) that we received from students from a wide variety of countries. Three papers were presented orally in one of the parallel sessions of the main conference. The other 22 papers were shown as posters as part of the poster session of the main conference.

The rapid advancement of artificial intelligence (AI) offers exciting opportunities for marketing practice and academic research. In this study, through the application of natural language processing, machine learning, and statistical... more

The rapid advancement of artificial intelligence (AI) offers exciting opportunities for marketing practice and academic research. In this study, through the application of natural language processing, machine learning, and statistical algorithms, we examine extant literature in terms of its dominant topics, diversity, evolution over time, and dynamics to map the existing knowledge base. Ten salient research themes emerge: (1) understanding consumer sentiments, (2) industrial opportunities of AI, (3) analyzing customer satisfaction, (4) electronic word-of-mouth-based insights, (5) improving market performance, (6) using AI for brand management, (7) measuring and enhancing customer loyalty and trust, (8) AI and novel services, (9) using AI to improve customer relationships, and (10) AI and strategic marketing. The scientometric analyses reveal key concepts, keyword co-occurrences, authorship networks, top research themes, landmark publications, and the evolution of the research field over time. With the insights as a foundation, this article closes with a proposed agenda for further research.

The present paper describes the importance and usage of metadata tagging and prediction modeling tools for researchers and librarians. 387 articles were downloaded from DESIDOC Journal of Library and Information Technology (DJLIT) for the... more

The present paper describes the importance and usage of metadata tagging and prediction modeling tools for researchers and librarians. 387 articles were downloaded from DESIDOC Journal of Library and Information Technology (DJLIT) for the period 2008-17. This study was divided into two phases. The first phase determined the core topics from the research articles using Topic-Modeling-Toolkit (TMT), which was based on latent Dirichlet allocation (LDA), whereas the second phase employed prediction analysis using RapidMiner toolbox to annotate the future research articles on the basis of the modeled topics. The core topics (tags) were found to be digital libraries, information literacy, scientometrics , open access, and library resources for the studied period. This study further annotated the scientific articles according to the modeled topics to provide a better searching experience to its users. Sugimoto,Li, Russell, et al. (2011), Figuerola, Marco,and Pinto (2017), and Lamba and Madhusudhan (2018) have performed studies similar to the present paper but with major modifications.

Naïve poetry, that is, poetry that has not passed through editorial filters, published by the authors on the Internet (on the website stihi.ru), provides unique material for the study of natural interpretations of important political... more

Naïve poetry, that is, poetry that has not passed through
editorial filters, published by the authors on the Internet (on the
website stihi.ru), provides unique material for the study of natural
interpretations of important political events. The article analyzes
poems in which the word “Crimea” is mentioned, from the website
stihi.ru, for the period 2000–2019. These poems are processed using the technology of topic modeling. The essence of this technology
is that in a large collection there are co-occurring words, usually
semantically close within a certain unified context. A series of such
words serve as representations of “topics”, that is what characterizes the text from a semantic point of view. The sample was divided
into several stages and a model of 5 themes was built for each. For
the Crimea, one of the main topics is that of paradise on earth. It
is the topic that unites most of the poems written before the events
of 2014. After 2014, in works by amateur authors we observe the
© R. LEIBOV, & B. V. OREKHOV
207
invasion of current topics into the established world of the resort;
the latter does not disappear, but gives way to politics. Five years
later, political topics remain, but landscape and love lyrics return
as well. On the topical level, we do not trace a clear influence of
“high” poetry on the authors of stihi.ru. Traditional in form (verse
division, rhythm, rhyme), the response of naïve poets breaks sharply with the literary tradition of substantive embodiment of such a
response. Topic modeling allows us to evaluate the transformation
of the Crimean plot in that segment of public consciousness which
is reflected in the production of naïve poets.

Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using... more

Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools. Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining. The volume argues that digital history is entering a mature phase, digital history ‘in action...

ii iii We selected 25 papers for publishing out of the 52 submissions (an acceptance rate of 48%) that we received from students from a wide variety of countries. Three papers were presented orally in one of the parallel sessions of the... more

ii iii We selected 25 papers for publishing out of the 52 submissions (an acceptance rate of 48%) that we received from students from a wide variety of countries. Three papers were presented orally in one of the parallel sessions of the main conference. The other 22 papers were shown as posters as part of the poster session of the main conference.

This presentation refers to the project doen by Ms. Sidra Mehtab as a part of her MSc (Data Science & Analytics) minor projects series. The project has two parts. In the Part I of the project, we have carried out a sentiment analysis on... more

This presentation refers to the project doen by Ms. Sidra Mehtab as a part of her MSc (Data Science & Analytics) minor projects series. The project has two parts. In the Part I of the project, we have carried out a sentiment analysis on Twitter data which is based on the reviews written by the customers of six US airlines. The tweets are already classified into three categories: “positive”, “negative”, and “neutral”. Using a supervised learning approach of classification we have used a Random Forest classifier model on the tweet data. We have tested the model on the test data and evaluated it on various metrics like “precision”, recall”, F1-score etc. In this second part of the project, we have carried out another important task of Text Mining which is known as Topic Modeling. We have carried out the task of Topic Modeling using Scikit-Learn library of Python. We have used a food review dataset consisting of 50K text reviews on various food items and categorized the reviews into various topics using a method called Latent Dirichlet Allocation (LDA).

Abstract-The classification of the emotions contained in the social media is of great importance in terms of its use in related fields such as media as well as developing technology. The Latent Dirichlet Allocation (LDA), a topic modeling... more

Abstract-The classification of the emotions contained in the social media is of great importance in terms of its use in related fields such as media as well as developing technology. The Latent Dirichlet Allocation (LDA), a topic modeling algorithm, was used to determine which emotions the tweets on Twitter had in the study. Dataset consists of angry, fear, happy, sadness and surprise, 5 emotions and 4000 tweets. Zemberek, Snowball and the first 5 letter root extraction methods are used to create the model. The generated models were tested with the n-stage LDA method we developed and compared with the LDA. For the 5 classes of normal LDA method, the highest 60.4% success was achieved; 70.5% for 2-stage LDA and 76.4% for 3-stage LDA.

We provide a brief, non-technical introduction to the text mining methodology known as topic modeling. We summarize the theory and background of the method and discuss just what kinds of things are found by topic models. Using a text... more

We provide a brief, non-technical introduction to the text mining methodology known as topic modeling. We summarize the theory and background of the method and discuss just what kinds of things are found by topic models. Using a text corpus comprised of the eight articles from the special issue of Poetics on the subject of topic models, we run a topic model on these papers both as a way to introduce the methodology and also to help summarize some of the ways in which social and cultural scientists are using topic models. We review some of the critiques and debates over the use of the method and finally, we link these developments back to some of the original innovations in the field of content analysis that were pioneered by Harold D. Lasswell and colleagues during and just after World War II.

摘要:互联网金融,自其作为一个概念被提出,其发展便伴随着 媒体的不同声音。为了能够科学、准确、量化地刻画互联网金融情绪 发展变化的脉络,我们利用近 1600 万条新闻全文数据,借助自然语 言处理、深度学习等方法,编制了一套覆盖 2013 年 1 月至 2017 年 4 月的互联网金融情绪指数,指数包含了对于互联网金融整体与 P2P 网 络借贷、互联网支付等 12 个子类的关注度与正负情感的度量。指数 表明,互联网金融的整体关注情况呈现出波动上扬的趋势,而对其整... more

摘要:互联网金融,自其作为一个概念被提出,其发展便伴随着
媒体的不同声音。为了能够科学、准确、量化地刻画互联网金融情绪
发展变化的脉络,我们利用近 1600 万条新闻全文数据,借助自然语
言处理、深度学习等方法,编制了一套覆盖 2013 年 1 月至 2017 年 4
月的互联网金融情绪指数,指数包含了对于互联网金融整体与 P2P 网
络借贷、互联网支付等 12 个子类的关注度与正负情感的度量。指数
表明,互联网金融的整体关注情况呈现出波动上扬的趋势,而对其整
体的正负情感态度,则振动较为剧烈。而互联网金融各子类,在关注
程度与正负情感态度上,则有着较大分化。
关键词:互联网金融、情绪指数、主题模型、词向量模型

Modeling the interests of researchers in academic social networks is a crucial step in a process of recommending scientific articles, linked to their areas of competence and expertise. In this context, a researcher profile constructed... more

Modeling the interests of researchers in academic social networks is a crucial step in a process of recommending scientific articles, linked to their areas of competence and expertise. In this context, a researcher profile constructed from non-observable variables on the basis of articles which interests him by the LDA (Latent Dirichlet Allocation) topic modeling technique allows the system to capture knowledge about his area of competence and skills, in order to predict these needs in terms of relevant research articles. In this article we are interested in the results produced by two different implementations of LDA Gensim and Mallet on the basis of information provided by the researchers (explicit information), in order to compare their interpretability and checked if they are reliable sources for model the areas of competence and expertise of scientists.

The concept of "life" certainly is of some use to distinguish birds and beavers from water and stones. This pragmatic usefulness has led to its construal as a categorical predicate that can sift out living entities from non-living ones... more

The concept of "life" certainly is of some use to distinguish birds and beavers from water and stones. This pragmatic usefulness has led to its construal as a categorical predicate that can sift out living entities from non-living ones depending on their possessing specific properties-reproduction, metabolism, evolvability etc. In this paper, we argue against this binary construal of life. Using text-mining methods across over 30,000 scientific articles, we defend instead a degrees-of-life view and show how these methods can contribute to experimental philosophy of science and concept explication. We apply topic-modeling algorithms to identify which specific properties are attributed to a target set of entities (bacteria, archaea, viruses, prions, plasmids, phages and the molecule of adenine). Eight major clusters of properties were identified together with their relative relevance for each target entity (two that relate to metabolism and catalysis, one to genetics, one to evolvability, one to structure, and-rather unexpectedly-three that concern interactions with the environment broadly construed). While aligning with intuitions-for instance about viruses being less alive than bacteria-these quantitative results also reveal differential degrees of performance that have so far remained elusive or overlooked. Taken together, Beyond categorical definitions of life: a data-driven approach to assessing lifeness 2 these analyses provide a conceptual "lifeness space" that makes it possible to move away from a categorical construal of life by empirically assessing the relative lifeness of more-or-less alive entities.

ABSTRACT We explore the double-edged sword of recombination in generating breakthrough innovation: recombination of distant or diverse knowledge is needed because knowledge in a narrow domain might trigger myopia; but, recombination can... more

ABSTRACT We explore the double-edged sword of recombination in generating breakthrough innovation: recombination of distant or diverse knowledge is needed because knowledge in a narrow domain might trigger myopia; but, recombination can be counterproductive when local search is needed to identify anomalies. We take into account how creativity shapes both the cognitive novelty of the idea and the subsequent realization of economic value. We develop a text-based measure of novel ideas in patents using topic modeling to identify those patents that originate new topics in a body of knowledge. We find that, counter to theories of recombination, patents that originate new topics are more likely to be associated with local search, while economic value is the product of broader recombinations as well as novelty.

Contextual advertising is a type of online advertising in which the placement of commercial ads within a web page depends on the relevance of the ads to the page content. A common approach to determine relevance is to score the match... more

Contextual advertising is a type of online advertising in which the placement of commercial ads within a web page depends on the relevance of the ads to the page content. A common approach to determine relevance is to score the match between ads and the content of the viewed page, for example, by simple keyword or syntactic matching. However, because of the sparseness of advertising language and the lack of context, this approach often leads to the selection of irrelevant ads. In this paper, we propose using topic modeling to improve the relevance of retrieved ads. Unlike existing methods that directly model the content of an ad as a distribution over topics, the proposed method uses a keyword-topic model that associates each keyword provided by the advertiser with a multinomial distribution over topics. Then, an ad with multiple keywords is represented as a mixture of topic distributions associated with those keywords. We empirically evaluated the performance of the proposed method on a set of real ads and web pages. The results show that using the keyword-topic model gives improved accuracy over traditional keyword matching and a topic modeling methods that do not include information about keyword-topic association. Further, combining the keyword-topic model with other methods yields extra increase in ad recommendation accuracy.

Community Question Answering (CQA) websites provide a rapidly growing source of information in many areas. This rapid growth, while offering new opportunities, puts forward new challenges. In most CQA implementations there is little... more

Community Question Answering (CQA) websites provide a rapidly growing source of information in many areas. This rapid growth, while offering new opportunities, puts forward new challenges. In most CQA implementations there is little effort in directing new questions to the right group of experts. This means that experts are not provided with questions matching their expertise, and therefore new matching questions may be missed and not receive a proper answer. We focus on finding experts for a newly posted question. We investigate the suitability of two statistical topic models for solving this issue and compare these methods against more traditional Information Retrieval approaches. We show that for a dataset constructed from the Stackoverflow website, these topic models outperform other methods in retrieving a candidate set of best experts for a question. We also show that the Segmented Topic Model gives consistently better performance compared to the Latent Dirichlet Allocation Model.

Today we are living in modern Internet era. We can get all our information from the internet anytime and from anywhere using a desktop PC or a smart phone. However, the underlying technology for relevant information retrieval from the... more

Today we are living in modern Internet era. We can get all our information from the internet anytime and from anywhere using a desktop PC or a smart phone. However, the underlying technology for relevant information retrieval from the internet is not so trivial, as internet is a huge repository of all different kinds of information. Moreover, data collection in the intern Retrieving the relevant information from the internet in the Big Data era is same as finding a needle in the haystack. This paper explores information retrieval models and experiments Semantic Indexing (LSI) first and then with the more efficient topic modeling algorithm of Latent Dirichlet Allocation (LDA). Comparisons between the two models are described clearly and concisely in their ef topic modeling. Various applications of topic modeling are also reviewed in this paper from the literature.

Public diplomacy is a fast-growing area of study with little agreement on its boundaries. In support of the subject’s development as a field of academic inquiry, we present a content analysis of English-language peer-reviewed articles on... more

Public diplomacy is a fast-growing area of study with little agreement on its boundaries. In support of the subject’s development as a field of academic inquiry, we present a content analysis of English-language peer-reviewed articles on public diplomacy since 1965 (N = 2,124). We begin with analysis of bibliographic data to establish the field’s institutional boundaries by highlighting trends in scholarship over time and identifying prominent disciplines and journals. We then sketch the field’s conceptual boundaries by analyzing the concepts and topics that appear most in the literature. This process allows us to characterize decades of scholarship on public diplomacy and offer recommendations for future work.
Keywords: public diplomacy, soft power, meta-analysis, topic modeling, text mining

This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages... more

This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.
CCS CONCEPTS • Information systems → Document topic models; • Theory of computation → Unsupervised learning and clustering.

On the microblogging site Twitter, users can forward any message they receive to all of their followers. This is called a retweet and is usually done when users find a message particularly interesting and worth sharing with others. Thus,... more

On the microblogging site Twitter, users can forward any message they receive to all of their followers. This is called a retweet and is usually done when users find a message particularly interesting and worth sharing with others. Thus, retweets reflect what the Twitter community considers interesting on a global scale, and can be used as a function of interestingness to generate a model to describe the content-based characteristics of retweets. In this paper, we analyze a set of high-and low-level content-based features on several large collections of Twitter messages. We train a prediction model to forecast for a given tweet its likelihood of being retweeted based on its contents. From the parameters learned by the model we deduce what are the influential content features that contribute to the likelihood of a retweet. As a result we obtain insights into what makes a message on Twitter worth retweeting and, thus, interesting.

The European refugee crisis received heightened attention at the beginning of September 2015, when images of the drowned child, Aylan Kurdi, surfaced across mainstream and social media. While the flows of displaced persons, especially... more

The European refugee crisis received heightened attention at the beginning of September 2015, when images of the drowned child, Aylan Kurdi, surfaced across mainstream and social media. While the flows of displaced persons, especially from the Middle East into Europe, had been ongoing until that date, this event and its coverage sparked a media firestorm. Mainstream-media content plays a major role in shaping discourse about events such as the refugee crisis, while social media's participatory affordances allow for the narratives to be perpetuated, challenged, and injected with new perspectives. In this study, the perspectives and narratives of the refugee crisis from the mainstream news and Twitter-in the days following Aylan's death-are compared and contrasted. Themes are extracted through topic modeling (LDA) and they reveal how news and Twitter converge and also diverge. We show that in the initial stages of a crisis and following the tragic death of Aylan, public discussion on Twitter was highly positive. Unlike the mainstream-media, Twitter offered an alternative and multifaceted narrative, not bound by geo-politics, raising awareness and calling for solidarity and empathy towards those affected. This study demonstrates how mainstream and social media form a new and complementary media space, where narratives are created and transformed.

We are developing indicators for the emergence of science and technology (S&T) topics. We are targeting various S&T information resources, including metadata (i.e., bibliographic information) and full text. We explore alternative text... more

We are developing indicators for the emergence of science and technology (S&T) topics. We are targeting various S&T information resources, including metadata (i.e., bibliographic information) and full text. We explore alternative text analysis approaches -principal components analysis (PCA) and topic modeling -to extract technical topic information. We analyze the topical content to pursue potential applications and innovation pathways.

In response to the coronavirus pandemic, the European Union (EU) governments develop policies to regulate exclusive health protection actions that consider societal needs with the emphasis on elders. Given that the EU vaccination strategy... more

In response to the coronavirus pandemic, the European Union (EU) governments develop policies to regulate exclusive health protection actions that consider societal needs with the emphasis on elders. Given that the EU vaccination strategy uses a centralized ICT-based approach, there is little guidance on how seniors are included in national immunization programs (NIP). In this paper, we addressed a knowledge gap of the side effects of egovernance of NIP for the elderly. To fill this gap, we identified 40 side effects by analyzing online textual opinions (tweets, comments, articles) that express public perception regarding the results of the Polish NIP implementation to seniors' digital inclusion, categorized them into 8 categories and assign them to four e-governance functions. The main contribution of this paper is a better understanding of the digital divide and to provide guidelines for government policy improvement.

Objectives: Machine learning based approaches for topic modeling are successful in extracting logical and semantic topics from a given collection of text. We experimented topic modelling approaches for Urdu poetry text to show that these... more

Objectives: Machine learning based approaches for topic modeling are successful in extracting logical and semantic topics
from a given collection of text. We experimented topic modelling approaches for Urdu poetry text to show that these
approaches perform equally well in any genre of text. Methods: Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet
Process (HDP), and Latent Semantic Indexing (LSI) were applied on three different datasets (i) CORPUS dataset for news,
(ii) Poetry Collection of Dr. Allama Iqbal, and (iii) Poetry collection of miscellaneous poets. Furthermore, each poetry corpus
includes more than five hundred poems approximately equivalent to 1200 documents. Findings: Before forwarding the
raw text to aforementioned models, we did feature engineering comprising of (i) Tokenization and removal of special
characters (if any), (ii) Removal of stop words, (iii) Lemmatization, and (iv) Stemming. For comparison of mentioned
approaches on our test samples, we used coherence and dominance model. Applications: Our experiment shows that
LDA, and LSI performed well on CORPUS dataset but none of the mentioned approaches performed well on poetry text.
This brings us to a conclusion that we need to devise sequence based models that allow users to define weights for poetry
specific text. This work opens a new direction for the domain of text generation and processing