alexander nwala - Academia.edu (original) (raw)
Papers by alexander nwala
arXiv (Cornell University), Dec 27, 2023
While social media are a key source of data for computational social science, their ease of manip... more While social media are a key source of data for computational social science, their ease of manipulation by malicious actors threatens the integrity of online information exchanges and their analysis. In this Chapter, we focus on malicious social bots, a prominent vehicle for such manipulation. We start by discussing recent studies about the presence and actions of social bots in various online discussions to show their realworld implications and the need for detection methods. Then we discuss the challenges of bot detection methods and use Botometer, a publicly available bot detection tool, as a case study to describe recent developments in this area. We close with a practical guide on how to handle social bots in social media research.
EPJ Data Science
Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformat... more Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformation, and sow discord. To these ends, they employ tactics that include the use of inauthentic accounts and campaigns. Methods to detect these abuses currently rely on features specifically designed to target suspicious behaviors. However, the effectiveness of these methods decays as malicious behaviors evolve. To address this challenge, we propose a language framework for modeling social media account behaviors. Words in this framework, called BLOC, consist of symbols drawn from distinct alphabets representing user actions and content. Languages from the framework are highly flexible and can be applied to model a broad spectrum of legitimate and suspicious online behaviors without extensive fine-tuning. Using BLOC to represent the behaviors of Twitter accounts, we achieve performance comparable to or better than state-of-the-art methods in the detection of social bots and coordinated inau...
In a Web plagued by disappearing resources, Web archive collections provide a valuable means of p... more In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but it is time consuming to collect these seeds. Two main strategies adopted by curators for discovering seeds include scraping Web (e.g., Google) Search Engine Result Pages (SERPs) and social media (e.g., Twitter) SERPs. In this work, we studied three social media platforms in order to provide insight on the characteristics of seeds generated from different sources. First, we developed a simple vocabulary for describing social media posts across different platforms. Second, we introduced a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share posts about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts micro-collections, and we consider them as an important source for seeds because the effort taken to create micro-collections is an indication of editorial activity, and a demonstration of domain expertise. Third, we generated 23,112 seed collections with text and hashtag queries from 449,347 social media posts from Reddit, Twitter, and Scoop.it. We collected in total 120,444 URIs from the conventional scraped SERP posts and micro-collections. We characterized the resultant seed collections across multiple dimensions including the distribution of URIs, precision, ages, diversity of webpages, etc. We showed that seeds generated by scraping SERPs had a higher median probability (0.63) of producing relevant URIs than micro-collections (0.5). However, micro-collections were more likely to produce seeds with a higher precision than conventional SERP collections for Twitter collections generated with hashtags. Also, micro-collections were more likely to produce older webpages and more non-HTML documents.
General purpose Search Engines (SEs) crawl all domains (e.g., Sports, News, Entertainment) of the... more General purpose Search Engines (SEs) crawl all domains (e.g., Sports, News, Entertainment) of the Web, but sometimes the informational need of a query is restricted to a particular domain (e.g., Medical). We leverage the work of SEs as part of our effort to route domain specific queries to local Digital Libraries (DLs). SEs are often used even if they are not the "best" source for certain types of queries. Rather than tell users to "use this DL for this kind of query", we intend to automatically detect when a query could be better served by a local DL (such as a private, access-controlled DL that is not crawlable via SEs). This is not an easy task because Web queries are short, ambiguous, and there is lack of quality labeled training data (or it is expensive to create). To detect queries that should be routed to local, specialized DLs, we first send the queries to Google and then examine the features in the resulting Search Engine Result Pages (SERPs), and then classify the query as belonging to either the scholar or non-scholar domain. Using 400,000 AOL queries for the non-scholar domain and 400,000 queries from the NASA Technical Report Server (NTRS) for the scholar domain, our classifier achieved a precision of 0.809 and F-measure of 0.805.
arXiv (Cornell University), Jun 24, 2018
News websites make editorial decisions about what stories to include on their website homepages a... more News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations are well-known phenomena but not well-measured. We provide a method for identifying the top news story for a select set of U.S.based news websites and then quantify the similarity across them. To achieve this, we first developed a headline and link extractor that parses select websites, and then examined ten United States based news website homepages during a three month period, November 2016 to January 2017. Using archived copies, retrieved from the Internet Archive (IA), we discuss the methods and difficulties for parsing these websites, and how events such as a presidential election can lead news websites to alter their document representation just for these events. We use our parser to extract k = 1, 3, 10 maximum number of stories for each news site. Second, we used the cosine similarity measure to calculate news similarity at 8PM Eastern Time for each day in the three months. The similarity scores show a buildup (0.335) before Election Day, with a declining value (0.328) on Election Day, and an increase (0.354) after Election Day. Our method shows that we can effectively identity top stories and quantify news similarity.
Bull. IEEE Tech. Comm. Digit. Libr., 2019
Archive-It collections of archived web pages provide a critical source of information for studyin... more Archive-It collections of archived web pages provide a critical source of information for studying important historical events ranging from social movements to terrorist events. The ever-changing nature of the web means that web archive collections preserve some of our collective digital heritage, and thus provide the means of studying events no longer present on the live web. There are many methods for collection building. Some adapt manual efforts, such as seed nomination on Google Docs, to begin the collection building process. The seeds are subsequently crawled (e.g., with focused crawlers) to discover more URIs. The different methods for generating collections solve some aspects of the collection building problem, but, irrespective of the method of collection building, most methods begin with seeds-an initial representative list of URIs (Uniform Resource Identifier) for the collection topic. Consequently, the discovery of seeds is a critical aspect of collection building. The traditional method of seed discovery requires manual effort and is an arduous process that often requires some domain knowledge about the collection topic, such as disasters and popular uprisings. This potentially limits the number of collections generated for important newsworthy events. We propose a seed generation method that extracts seeds from user-generated collections (microcollections) in social media such as Twitter, Wikipedia references, and Reddit. The discovered seeds may augment existing collections or bootstrap new collections, thus accelerating the collection building process that largely relies on a few curators to start. Additionally, we propose a Collection Characterization Suite (CCS) to characterize and evaluate the collections generated. An important part of collection generation is the characterization or description of the collections that are generated and not just the generation of collections. The CCS provides a means of characterizing individual collections and serves as a baseline for comparing multiple collections. CCS CONCEPTS •Information systems →Digital libraries and archives;
arXiv (Cornell University), Mar 22, 2020
We investigate the overlap of topics of online news articles from a variety of sources. To do thi... more We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days versus major news days. Our application, StoryGraph, periodically (10-minute intervals) extracts the first five news articles from the RSS feeds of 17 US news media organizations across the partisanship spectrum (left, center, and right). From these articles, StoryGraph extracts named entities (PEOPLE, LOCATIONS, ORGANIZATIONS, etc.) and then represents each news article with its set of extracted named entities. Finally, StoryGraph generates a news similarity graph where the nodes represent news articles, and an edge between a pair of nodes represents a high degree of similarity between the nodes (similar news stories). Each news story within the news similarity graph is assigned an attention score which quantifies the amount of attention the topics in the news story receive collectively from the news media organizations. The StoryGraph service has been running since August 2017, and using this method, we determined that the top news story of 2018 was the Kavanaugh hearings with attention score of 25.85 on September 27, 2018. Similarly, the top news story for 2019 so far (2019-12-12) is AG William Barr's release of his principal conclusions of the Mueller Report, with an attention score of 22.93 on March 24, 2019.
News websites make editorial decisions about what stories to include on their website homepages a... more News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations are well-known phenomena but not well-measured. We provide a method for identifying the top news story for a select set of U.S.based news websites and then quantify the similarity across them. To achieve this, we first developed a headline and link extractor that parses select websites, and then examined ten United States based news website homepages during a three month period, November 2016 to January 2017. Using archived copies, retrieved from the Internet Archive (IA), we discuss the methods and difficulties for parsing these websites, and how events such as a presidential election can lead news websites to alter their document representation just for these events. We use our parser to extract k = 1, 3, 10 maximum number of stories for each news site. Second, we used the cosine similarity measure to calculate news similarity at 8PM Eastern Time for each day in the three months. The similarity scores show a buildup (0.335) before Election Day, with a declining value (0.328) on Election Day, and an increase (0.354) after Election Day. Our method shows that we can effectively identity top stories and quantify news similarity.
arXiv (Cornell University), May 23, 2018
Event-based collections are often started with a web search, but the search results you find on D... more Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper 1 , we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing, " "london terrorism, " "trump russia, " "travel ban, " "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21-0.54, and a weekly rate of 0.39-0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34-0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01-0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
ACM/IEEE Joint Conference on Digital Libraries, Jun 19, 2017
The national (non-local) news media has different priorities than the local news media. If one se... more The national (non-local) news media has different priorities than the local news media. If one seeks to build a collection of stories about local events, the national news media may be insufficient, with the exception of local news which "bubbles" up to the national news media. If we rely exclusively on national media, or build collections exclusively on their reports, we could be late to the important milestones which precipitate major local events, thus, run the risk of losing important stories due to link rot and content drift. Consequently, it is important to consult local sources affected by local events. Our goal is to provide a suite of tools (beginning with two) under the umbrella of the Local Memory Project (LMP) to help users and small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources. The first service (Geo) returns a list of local news sources (newspaper, TV and radio stations) in order of proximity to a user-supplied zip code. The second service (Local Stories Collection Generator) discovers, collects and archives a collection of news stories about a story or event represented by a user-supplied query and zip code pair. We evaluated 20 pairs of collections, Local (generated by our system) and non-Local, by measuring archival coverage, tweet index rate, temporal range, precision, and sub-collection overlap. Our experimental results showed Local and non-Local collections with archive rates of 0.63 and 0.83, respectively, and tweet index rates of 0.59 and 0.80, respectively. Local collections produced older stories than non-Local collections, at a higher precision (relevance) of 0.84 compared to a non-Local precision of 0.72. These results indicate that Local collections are less exposed, thus less popular than their non-Local counterpart.
arXiv (Cornell University), Nov 1, 2022
Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformat... more Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformation, and sow discord. To these ends, they employ tactics that include the use of inauthentic accounts and campaigns. Methods to detect these abuses currently rely on features specifically designed to target suspicious behaviors. However, the effectiveness of these methods decays as malicious behaviors evolve. To address this challenge, we propose a general language for modeling social media account behavior. Words in this language, called BLOC, consist of symbols drawn from distinct alphabets representing user actions and content. The language is highly flexible and can be applied to model a broad spectrum of legitimate and suspicious online behaviors without extensive fine-tuning. Using BLOC to represent the behaviors of Twitter accounts, we achieve performance comparable to or better than state-of-the-art methods in the detection of social bots and coordinated inauthentic behavior.
Proceedings of the International AAAI Conference on Web and Social Media
Research into influence campaigns on Twitter has mostly relied on identifying malicious activitie... more Research into influence campaigns on Twitter has mostly relied on identifying malicious activities from tweets obtained via public APIs. By design, these approaches ignore deleted tweets. However, bad actors can delete content strategically to manipulate the system. Here, we provide the first exhaustive, large-scale analysis of anomalous deletion patterns involving more than a billion deletions by over 11 million accounts. Estimates based on publicly available Twitter data underestimate the true deletion volume. A small fraction of accounts delete a large number of tweets daily. We uncover two abusive behaviors that exploit deletions. First, limits on tweet volume are circumvented, allowing certain accounts to flood the network with over 26 thousand daily tweets. Second, coordinated networks of accounts engage in repetitive likes and unlikes of content that is eventually deleted, which can manipulate ranking algorithms. These kinds of abuse can be exploited to amplify content and in...
Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
Event-based collections are often started with a web search, but the search results you find on D... more Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper 1 , we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing, " "london terrorism, " "trump russia, " "travel ban, " "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21-0.54, and a weekly rate of 0.39-0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34-0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01-0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
In a Web plagued by disappearing resources, Web archive collections provide a valuable means of p... more In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but it is time consuming to collect these seeds. Two main strategies adopted by curators for discovering seeds include scraping Web (e.g., Google) Search Engine Result Pages (SERPs) and social media (e.g., Twitter) SERPs. In this work, we studied three social media platforms in order to provide insight on the characteristics of seeds generated from different sources. First, we developed a simple vocabulary for describing social media posts across different platforms. Second, we introduced a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share posts about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts micro-collections, and we consider them as an important source for seeds because the effort taken to create micro-collections is an indication of editorial activity, and a demonstration of domain expertise. Third, we generated 23,112 seed collections with text and hashtag queries from 449,347 social media posts from Reddit, Twitter, and Scoop.it. We collected in total 120,444 URIs from the conventional scraped SERP posts and micro-collections. We characterized the resultant seed collections across multiple dimensions including the distribution of URIs, precision, ages, diversity of webpages, etc. We showed that seeds generated by scraping SERPs had a higher median probability (0.63) of producing relevant URIs than micro-collections (0.5). However, micro-collections were more likely to produce seeds with a higher precision than conventional SERP collections for Twitter collections generated with hashtags. Also, micro-collections were more likely to produce older webpages and more non-HTML documents.
2020 IEEE International Conference on Big Data (Big Data), 2020
The vastness of the web imposes a prohibitive cost on building large-scale search engines with li... more The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors' homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency (λ) values. Our evaluation shows that λ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.
2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021
From popular uprisings to pandemics, the Web is an essential source consulted by scientists and h... more From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by link rot and content drift (reference rot) which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories and events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media posts. The quality of social media content varies widely, therefore, we propose 1 a framework for assigning multidimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: , ℎ , , subject expert, , , , and. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust; seeds can be scored when a subset of the QP dimensions are absent. Most importantly, scores assigned by Quality Proxies are explainable, providing the opportunity to critique them. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by ∼0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains that select seeds from social media.
2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021
Popular web pages are archived frequently, which makes it difficult to visualize the progression ... more Popular web pages are archived frequently, which makes it difficult to visualize the progression of the site through the years at web archives. The What Did It Look Like (WDILL) Twitter bot shows web page transitions by creating a timelapse of a given website using one archived copy from each calendar year. Originally implemented in 2015, we recently added new features to WDILL, such as date range requests, diversified memento selection, updated visualizations, and sharing visualizations to Instagram. This would allow scholars and the general public to explore the temporal nature of web archives.
2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2017
The national (non-local) news media has different priorities than the local news media. If one se... more The national (non-local) news media has different priorities than the local news media. If one seeks to build a collection of stories about local events, the national news media may be insufficient, with the exception of local news which "bubbles" up to the national news media. If we rely exclusively on national media, or build collections exclusively on their reports, we could be late to the important milestones which precipitate major local events, thus, run the risk of losing important stories due to link rot and content drift. Consequently, it is important to consult local sources affected by local events. Our goal is to provide a suite of tools (beginning with two) under the umbrella of the Local Memory Project (LMP) to help users and small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources. The first service (Geo) returns a list of local news sources (newspaper, TV and radio stations) in order of proximity to a user-supplied zip code. The second service (Local Stories Collection Generator) discovers, collects and archives a collection of news stories about a story or event represented by a user-supplied query and zip code pair. We evaluated 20 pairs of collections, Local (generated by our system) and non-Local, by measuring archival coverage, tweet index rate, temporal range, precision, and sub-collection overlap. Our experimental results showed Local and non-Local collections with archive rates of 0.63 and 0.83, respectively, and tweet index rates of 0.59 and 0.80, respectively. Local collections produced older stories than non-Local collections, at a higher precision (relevance) of 0.84 compared to a non-Local precision of 0.72. These results indicate that Local collections are less exposed, thus less popular than their non-Local counterpart.
ArXiv, 2020
We investigate the overlap of topics of online news articles from a variety of sources. To do thi... more We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days versus major news days. Our application, StoryGraph, periodically (10-minute intervals) extracts the first five news articles from the RSS feeds of 17 US news media organizations across the partisanship spectrum (left, center, and right). From these articles, StoryGraph extracts named entities (PEOPLE, LOCATIONS, ORGANIZATIONS, etc.) and then represents each news article with its set of extracted named entities. Finally, StoryGraph generates a news similarity graph where the nodes represent news articles, and an edge between a pair of nodes represents a high degree of similarity between th...
arXiv (Cornell University), Dec 27, 2023
While social media are a key source of data for computational social science, their ease of manip... more While social media are a key source of data for computational social science, their ease of manipulation by malicious actors threatens the integrity of online information exchanges and their analysis. In this Chapter, we focus on malicious social bots, a prominent vehicle for such manipulation. We start by discussing recent studies about the presence and actions of social bots in various online discussions to show their realworld implications and the need for detection methods. Then we discuss the challenges of bot detection methods and use Botometer, a publicly available bot detection tool, as a case study to describe recent developments in this area. We close with a practical guide on how to handle social bots in social media research.
EPJ Data Science
Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformat... more Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformation, and sow discord. To these ends, they employ tactics that include the use of inauthentic accounts and campaigns. Methods to detect these abuses currently rely on features specifically designed to target suspicious behaviors. However, the effectiveness of these methods decays as malicious behaviors evolve. To address this challenge, we propose a language framework for modeling social media account behaviors. Words in this framework, called BLOC, consist of symbols drawn from distinct alphabets representing user actions and content. Languages from the framework are highly flexible and can be applied to model a broad spectrum of legitimate and suspicious online behaviors without extensive fine-tuning. Using BLOC to represent the behaviors of Twitter accounts, we achieve performance comparable to or better than state-of-the-art methods in the detection of social bots and coordinated inau...
In a Web plagued by disappearing resources, Web archive collections provide a valuable means of p... more In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but it is time consuming to collect these seeds. Two main strategies adopted by curators for discovering seeds include scraping Web (e.g., Google) Search Engine Result Pages (SERPs) and social media (e.g., Twitter) SERPs. In this work, we studied three social media platforms in order to provide insight on the characteristics of seeds generated from different sources. First, we developed a simple vocabulary for describing social media posts across different platforms. Second, we introduced a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share posts about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts micro-collections, and we consider them as an important source for seeds because the effort taken to create micro-collections is an indication of editorial activity, and a demonstration of domain expertise. Third, we generated 23,112 seed collections with text and hashtag queries from 449,347 social media posts from Reddit, Twitter, and Scoop.it. We collected in total 120,444 URIs from the conventional scraped SERP posts and micro-collections. We characterized the resultant seed collections across multiple dimensions including the distribution of URIs, precision, ages, diversity of webpages, etc. We showed that seeds generated by scraping SERPs had a higher median probability (0.63) of producing relevant URIs than micro-collections (0.5). However, micro-collections were more likely to produce seeds with a higher precision than conventional SERP collections for Twitter collections generated with hashtags. Also, micro-collections were more likely to produce older webpages and more non-HTML documents.
General purpose Search Engines (SEs) crawl all domains (e.g., Sports, News, Entertainment) of the... more General purpose Search Engines (SEs) crawl all domains (e.g., Sports, News, Entertainment) of the Web, but sometimes the informational need of a query is restricted to a particular domain (e.g., Medical). We leverage the work of SEs as part of our effort to route domain specific queries to local Digital Libraries (DLs). SEs are often used even if they are not the "best" source for certain types of queries. Rather than tell users to "use this DL for this kind of query", we intend to automatically detect when a query could be better served by a local DL (such as a private, access-controlled DL that is not crawlable via SEs). This is not an easy task because Web queries are short, ambiguous, and there is lack of quality labeled training data (or it is expensive to create). To detect queries that should be routed to local, specialized DLs, we first send the queries to Google and then examine the features in the resulting Search Engine Result Pages (SERPs), and then classify the query as belonging to either the scholar or non-scholar domain. Using 400,000 AOL queries for the non-scholar domain and 400,000 queries from the NASA Technical Report Server (NTRS) for the scholar domain, our classifier achieved a precision of 0.809 and F-measure of 0.805.
arXiv (Cornell University), Jun 24, 2018
News websites make editorial decisions about what stories to include on their website homepages a... more News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations are well-known phenomena but not well-measured. We provide a method for identifying the top news story for a select set of U.S.based news websites and then quantify the similarity across them. To achieve this, we first developed a headline and link extractor that parses select websites, and then examined ten United States based news website homepages during a three month period, November 2016 to January 2017. Using archived copies, retrieved from the Internet Archive (IA), we discuss the methods and difficulties for parsing these websites, and how events such as a presidential election can lead news websites to alter their document representation just for these events. We use our parser to extract k = 1, 3, 10 maximum number of stories for each news site. Second, we used the cosine similarity measure to calculate news similarity at 8PM Eastern Time for each day in the three months. The similarity scores show a buildup (0.335) before Election Day, with a declining value (0.328) on Election Day, and an increase (0.354) after Election Day. Our method shows that we can effectively identity top stories and quantify news similarity.
Bull. IEEE Tech. Comm. Digit. Libr., 2019
Archive-It collections of archived web pages provide a critical source of information for studyin... more Archive-It collections of archived web pages provide a critical source of information for studying important historical events ranging from social movements to terrorist events. The ever-changing nature of the web means that web archive collections preserve some of our collective digital heritage, and thus provide the means of studying events no longer present on the live web. There are many methods for collection building. Some adapt manual efforts, such as seed nomination on Google Docs, to begin the collection building process. The seeds are subsequently crawled (e.g., with focused crawlers) to discover more URIs. The different methods for generating collections solve some aspects of the collection building problem, but, irrespective of the method of collection building, most methods begin with seeds-an initial representative list of URIs (Uniform Resource Identifier) for the collection topic. Consequently, the discovery of seeds is a critical aspect of collection building. The traditional method of seed discovery requires manual effort and is an arduous process that often requires some domain knowledge about the collection topic, such as disasters and popular uprisings. This potentially limits the number of collections generated for important newsworthy events. We propose a seed generation method that extracts seeds from user-generated collections (microcollections) in social media such as Twitter, Wikipedia references, and Reddit. The discovered seeds may augment existing collections or bootstrap new collections, thus accelerating the collection building process that largely relies on a few curators to start. Additionally, we propose a Collection Characterization Suite (CCS) to characterize and evaluate the collections generated. An important part of collection generation is the characterization or description of the collections that are generated and not just the generation of collections. The CCS provides a means of characterizing individual collections and serves as a baseline for comparing multiple collections. CCS CONCEPTS •Information systems →Digital libraries and archives;
arXiv (Cornell University), Mar 22, 2020
We investigate the overlap of topics of online news articles from a variety of sources. To do thi... more We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days versus major news days. Our application, StoryGraph, periodically (10-minute intervals) extracts the first five news articles from the RSS feeds of 17 US news media organizations across the partisanship spectrum (left, center, and right). From these articles, StoryGraph extracts named entities (PEOPLE, LOCATIONS, ORGANIZATIONS, etc.) and then represents each news article with its set of extracted named entities. Finally, StoryGraph generates a news similarity graph where the nodes represent news articles, and an edge between a pair of nodes represents a high degree of similarity between the nodes (similar news stories). Each news story within the news similarity graph is assigned an attention score which quantifies the amount of attention the topics in the news story receive collectively from the news media organizations. The StoryGraph service has been running since August 2017, and using this method, we determined that the top news story of 2018 was the Kavanaugh hearings with attention score of 25.85 on September 27, 2018. Similarly, the top news story for 2019 so far (2019-12-12) is AG William Barr's release of his principal conclusions of the Mueller Report, with an attention score of 22.93 on March 24, 2019.
News websites make editorial decisions about what stories to include on their website homepages a... more News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations are well-known phenomena but not well-measured. We provide a method for identifying the top news story for a select set of U.S.based news websites and then quantify the similarity across them. To achieve this, we first developed a headline and link extractor that parses select websites, and then examined ten United States based news website homepages during a three month period, November 2016 to January 2017. Using archived copies, retrieved from the Internet Archive (IA), we discuss the methods and difficulties for parsing these websites, and how events such as a presidential election can lead news websites to alter their document representation just for these events. We use our parser to extract k = 1, 3, 10 maximum number of stories for each news site. Second, we used the cosine similarity measure to calculate news similarity at 8PM Eastern Time for each day in the three months. The similarity scores show a buildup (0.335) before Election Day, with a declining value (0.328) on Election Day, and an increase (0.354) after Election Day. Our method shows that we can effectively identity top stories and quantify news similarity.
arXiv (Cornell University), May 23, 2018
Event-based collections are often started with a web search, but the search results you find on D... more Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper 1 , we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing, " "london terrorism, " "trump russia, " "travel ban, " "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21-0.54, and a weekly rate of 0.39-0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34-0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01-0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
ACM/IEEE Joint Conference on Digital Libraries, Jun 19, 2017
The national (non-local) news media has different priorities than the local news media. If one se... more The national (non-local) news media has different priorities than the local news media. If one seeks to build a collection of stories about local events, the national news media may be insufficient, with the exception of local news which "bubbles" up to the national news media. If we rely exclusively on national media, or build collections exclusively on their reports, we could be late to the important milestones which precipitate major local events, thus, run the risk of losing important stories due to link rot and content drift. Consequently, it is important to consult local sources affected by local events. Our goal is to provide a suite of tools (beginning with two) under the umbrella of the Local Memory Project (LMP) to help users and small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources. The first service (Geo) returns a list of local news sources (newspaper, TV and radio stations) in order of proximity to a user-supplied zip code. The second service (Local Stories Collection Generator) discovers, collects and archives a collection of news stories about a story or event represented by a user-supplied query and zip code pair. We evaluated 20 pairs of collections, Local (generated by our system) and non-Local, by measuring archival coverage, tweet index rate, temporal range, precision, and sub-collection overlap. Our experimental results showed Local and non-Local collections with archive rates of 0.63 and 0.83, respectively, and tweet index rates of 0.59 and 0.80, respectively. Local collections produced older stories than non-Local collections, at a higher precision (relevance) of 0.84 compared to a non-Local precision of 0.72. These results indicate that Local collections are less exposed, thus less popular than their non-Local counterpart.
arXiv (Cornell University), Nov 1, 2022
Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformat... more Malicious actors exploit social media to inflate stock prices, sway elections, spread misinformation, and sow discord. To these ends, they employ tactics that include the use of inauthentic accounts and campaigns. Methods to detect these abuses currently rely on features specifically designed to target suspicious behaviors. However, the effectiveness of these methods decays as malicious behaviors evolve. To address this challenge, we propose a general language for modeling social media account behavior. Words in this language, called BLOC, consist of symbols drawn from distinct alphabets representing user actions and content. The language is highly flexible and can be applied to model a broad spectrum of legitimate and suspicious online behaviors without extensive fine-tuning. Using BLOC to represent the behaviors of Twitter accounts, we achieve performance comparable to or better than state-of-the-art methods in the detection of social bots and coordinated inauthentic behavior.
Proceedings of the International AAAI Conference on Web and Social Media
Research into influence campaigns on Twitter has mostly relied on identifying malicious activitie... more Research into influence campaigns on Twitter has mostly relied on identifying malicious activities from tweets obtained via public APIs. By design, these approaches ignore deleted tweets. However, bad actors can delete content strategically to manipulate the system. Here, we provide the first exhaustive, large-scale analysis of anomalous deletion patterns involving more than a billion deletions by over 11 million accounts. Estimates based on publicly available Twitter data underestimate the true deletion volume. A small fraction of accounts delete a large number of tweets daily. We uncover two abusive behaviors that exploit deletions. First, limits on tweet volume are circumvented, allowing certain accounts to flood the network with over 26 thousand daily tweets. Second, coordinated networks of accounts engage in repetitive likes and unlikes of content that is eventually deleted, which can manipulate ranking algorithms. These kinds of abuse can be exploited to amplify content and in...
Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
Event-based collections are often started with a web search, but the search results you find on D... more Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper 1 , we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing, " "london terrorism, " "trump russia, " "travel ban, " "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21-0.54, and a weekly rate of 0.39-0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34-0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01-0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
In a Web plagued by disappearing resources, Web archive collections provide a valuable means of p... more In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but it is time consuming to collect these seeds. Two main strategies adopted by curators for discovering seeds include scraping Web (e.g., Google) Search Engine Result Pages (SERPs) and social media (e.g., Twitter) SERPs. In this work, we studied three social media platforms in order to provide insight on the characteristics of seeds generated from different sources. First, we developed a simple vocabulary for describing social media posts across different platforms. Second, we introduced a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share posts about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts micro-collections, and we consider them as an important source for seeds because the effort taken to create micro-collections is an indication of editorial activity, and a demonstration of domain expertise. Third, we generated 23,112 seed collections with text and hashtag queries from 449,347 social media posts from Reddit, Twitter, and Scoop.it. We collected in total 120,444 URIs from the conventional scraped SERP posts and micro-collections. We characterized the resultant seed collections across multiple dimensions including the distribution of URIs, precision, ages, diversity of webpages, etc. We showed that seeds generated by scraping SERPs had a higher median probability (0.63) of producing relevant URIs than micro-collections (0.5). However, micro-collections were more likely to produce seeds with a higher precision than conventional SERP collections for Twitter collections generated with hashtags. Also, micro-collections were more likely to produce older webpages and more non-HTML documents.
2020 IEEE International Conference on Big Data (Big Data), 2020
The vastness of the web imposes a prohibitive cost on building large-scale search engines with li... more The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors' homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency (λ) values. Our evaluation shows that λ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.
2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021
From popular uprisings to pandemics, the Web is an essential source consulted by scientists and h... more From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by link rot and content drift (reference rot) which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories and events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media posts. The quality of social media content varies widely, therefore, we propose 1 a framework for assigning multidimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: , ℎ , , subject expert, , , , and. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust; seeds can be scored when a subset of the QP dimensions are absent. Most importantly, scores assigned by Quality Proxies are explainable, providing the opportunity to critique them. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by ∼0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains that select seeds from social media.
2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021
Popular web pages are archived frequently, which makes it difficult to visualize the progression ... more Popular web pages are archived frequently, which makes it difficult to visualize the progression of the site through the years at web archives. The What Did It Look Like (WDILL) Twitter bot shows web page transitions by creating a timelapse of a given website using one archived copy from each calendar year. Originally implemented in 2015, we recently added new features to WDILL, such as date range requests, diversified memento selection, updated visualizations, and sharing visualizations to Instagram. This would allow scholars and the general public to explore the temporal nature of web archives.
2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2017
The national (non-local) news media has different priorities than the local news media. If one se... more The national (non-local) news media has different priorities than the local news media. If one seeks to build a collection of stories about local events, the national news media may be insufficient, with the exception of local news which "bubbles" up to the national news media. If we rely exclusively on national media, or build collections exclusively on their reports, we could be late to the important milestones which precipitate major local events, thus, run the risk of losing important stories due to link rot and content drift. Consequently, it is important to consult local sources affected by local events. Our goal is to provide a suite of tools (beginning with two) under the umbrella of the Local Memory Project (LMP) to help users and small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources. The first service (Geo) returns a list of local news sources (newspaper, TV and radio stations) in order of proximity to a user-supplied zip code. The second service (Local Stories Collection Generator) discovers, collects and archives a collection of news stories about a story or event represented by a user-supplied query and zip code pair. We evaluated 20 pairs of collections, Local (generated by our system) and non-Local, by measuring archival coverage, tweet index rate, temporal range, precision, and sub-collection overlap. Our experimental results showed Local and non-Local collections with archive rates of 0.63 and 0.83, respectively, and tweet index rates of 0.59 and 0.80, respectively. Local collections produced older stories than non-Local collections, at a higher precision (relevance) of 0.84 compared to a non-Local precision of 0.72. These results indicate that Local collections are less exposed, thus less popular than their non-Local counterpart.
ArXiv, 2020
We investigate the overlap of topics of online news articles from a variety of sources. To do thi... more We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days versus major news days. Our application, StoryGraph, periodically (10-minute intervals) extracts the first five news articles from the RSS feeds of 17 US news media organizations across the partisanship spectrum (left, center, and right). From these articles, StoryGraph extracts named entities (PEOPLE, LOCATIONS, ORGANIZATIONS, etc.) and then represents each news article with its set of extracted named entities. Finally, StoryGraph generates a news similarity graph where the nodes represent news articles, and an edge between a pair of nodes represents a high degree of similarity between th...