A Pipeline for Post-Crisis Twitter Data Acquisition (original) (raw)

A Pipeline for Rapid Post-Crisis Twitter Data Acquisition, Filtering and Visualization

Technologies

Due to instant availability of data on social media platforms like Twitter, and advances in machine learning and data management technology, real-time crisis informatics has emerged as a prolific research area in the last decade. Although several benchmarks are now available, especially on portals like CrisisLex, an important, practical problem that has not been addressed thus far is the rapid acquisition, benchmarking and visual exploration of data from free, publicly available streams like the Twitter API in the immediate aftermath of a crisis. In this paper, we present such a pipeline for facilitating immediate post-crisis data collection, curation and relevance filtering from the Twitter API. The pipeline is minimally supervised, alleviating the need for feature engineering by including a judicious mix of data preprocessing and fast text embeddings, along with an active learning framework. We illustrate the utility of the pipeline by describing a recent case study wherein it was...

CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing

Proceedings of the International AAAI Conference on Web and Social Media

The time-critical analysis of social media streams is important for humanitarian organizations to plan rapid response during disasters. The crisis informatics research community has developed several techniques and systems to process and classify big crisis-related data posted on social media. However, due to the dispersed nature of the datasets used in the literature, it is not possible to compare the results and measure the progress made towards better models for crisis informatics. In this work, we attempt to bridge this gap by combining various existing crisis-related datasets. We consolidate eight annotated data sources and provide 166.1k and 141.5k tweets for informativeness and humanitarian classification tasks, respectively. The consolidation results in a larger dataset that affords the ability to train more sophisticated models. To that end, we provide binary and multiclass classification results using CNN, FastText, and transformer based models to address informativeness a...

Active Learning for Social Media Analysis in Crisis Situations

2020

Social media has become an important open communication medium during crisis. This has motivated much work on social media data analysis for crisis situations using machine learning techniques but has mostly been carried out by traditional techniques. Those methods have shown mixed results and are criticized for being unable to generalize beyond the scope of the designed study. Since every crisis is special, such retrospect models have little value. In contrast, active learning shows very promising results by learning in noisy environments such as image classification and game playing. It has, therefore great potential to play a significant role in the future social media analysis in noisy crisis situations. This position paper proposes an approach to improve the social media analysis in crisis situations to achieve better understanding and decision support during a crisis. In this approach, we aim to use active learning to extract features and patterns related to the text and conce...

Standardizing and Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing

arXiv (Cornell University), 2020

Time-critical analysis of social media streams is important for humanitarian organizations to plan rapid response during disasters. The crisis informatics research community has developed several techniques and systems to process and classify big crisis related data posted on social media. However, due to the dispersed nature of the datasets used in the literature, it is not possible to compare the results and measure the progress made towards better models for crisis informatics. In this work, we attempt to bridge this gap by standardizing various existing crisis-related datasets. We consolidate labels of eight annotated data sources and provide 166.1k and 141.5k tweets for informativeness and humanitarian classification tasks, respectively. The consolidation results in a larger dataset that affords the ability to train more sophisticated models. To that end, we provide baseline results using CNN and BERT models. We make the dataset available at https://crisisnlp.qcri.org/crisis\_datasets\_benchmarks.html.

Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Messages

In Proceedings of The 10th International Conference on Information Systems for Crisis Response and Management (ISCRAM), Baden-Baden, Germany, 2013

We present Tweet4act, a system to detect and classify crisis-related messages communicated over a microblogging platform. Our system relies on extracting content features from each message. These features and the use of an incident-specific dictionary allow us to determine the period type of an incident that each message belongs to. The period types are: pre-incident (messages talking about prevention, mitigation, and preparedness), during-incident (messages sent while the incident is taking place), and post-incident (messages related to the response, recovery, and reconstruction). We show that our detection method can effectively identify incident-related messages with high precision and recall, and that our incident-period classification method outperforms standard machine learning classification methods.

Extracting Valuable Information from Twitter during Natural Disasters

Social media is a vital source of information during any major event, especially natural disasters. However, with the exponential increase in volume of social media data, so comes the increase in conversational data that does not provide valuable information, especially in the context of disaster events, thus, diminishing peoples' ability to find the information that they need in order to organize relief efforts, find help, and potentially save lives. This project focuses on the development of a Bayesian approach to the classification of tweets (posts on Twitter) during Hurricane Sandy in order to distinguish "informational" from "conversational" tweets. We designed an effective set of features and used them as input to Naïve Bayes classifiers. In comparison to a "bag of words" approach, the new feature set provides similar results in the classification of tweets. However, the designed feature set contains only 9 features compared with more than 3000 features for "bag of words." When the feature set is combined with "bag of words", accuracy achieves 85.2914%. If integrated into disaster-related systems, our approach can serve as a boon to any person or organization seeking to extract useful information in the midst of a natural disaster.

Crisis Event Extraction Service (CREES) - Automatic Detection and Classification of Crisis-related Content on Social Media

2018

Social media posts tend to provide valuable reports during crises. However, this information can be hidden in large amounts of unrelated documents. Providing tools that automatically identify relevant posts, event types (e.g., hurricane, floods, etc.) and information categories (e.g., reports on affected individuals, donations and volunteering, etc.) in social media posts is vital for their efficient handling and consumption. We introduce the Crisis Event Extraction Service (CREES), an open-source web API that automatically classifies posts during crisis situations. The API provides annotations for crisis-related documents, event types and information categories through an easily deployable and accessible web API that can be integrated into multiple platform and tools. The annotation service is backed by Convolutional Neural Networks (CNNs) and validated against traditional machine learning models. Results show that the CNN-based API results can be relied upon when dealing with spec...

A language-agnostic approach to exact informative tweets during emergency situations

2017 IEEE International Conference on Big Data (Big Data), 2017

In this paper, we propose a machine learning approach to automatically classify non-informative and informative contents shared on Twitter during disasters caused by natural hazards. In particular, we leverage on previously sampled and labeled datasets of messages posted on Twitter during or in the aftermath of natural disasters. Starting from results obtained in previous studies, we propose a language-agnostic model. We define a base feature set considering only Twitter-specific metadata of each tweet, using classification results from this set as a reference. We introduce an additional feature, called the Source Feature, which is computed considering the device or platform used to post a tweet, and we evaluate its contribution in improving the classifier accuracy. Index Terms-Disaster relief; social media analysis; classification; machine learning; real-world traces.

HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep Learning Benchmarks

2021

Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models. In this paper, we present a new large-scale dataset with ∼77K human-labeled tweets, sampled from a pool of ∼24 million tweets across 19 disaster events that happened between 2016 and 2019. Moreover, we propose a data collection and sam...

User-Assisted Information Extraction from Twitter During Emergencies

2017

Disasters and emergencies bring uncertain situations. People involved in such situations look for quick answers to their rapid queries. Moreover, humanitarian organizations look for situational awareness information to launch relief operations. Existing studies show the usefulness of social media content during crisis situations. However, despite advances in information retrieval and text processing techniques, access to relevant information on Twitter is still a challenging task. In this paper, we propose a novel approach to provide timely access to the relevant information on Twitter. Specifically, we employee Word2vec embeddings to expand initial users queries and based on a relevance feedback mechanism we retrieve relevant messages on Twitter in real-time. Initial experiments and user studies performed using a real world disaster dataset show the significance of the proposed approach.