Extracting Humanitarian Information from Tweets (original) (raw)

Natural Language Processing to the Rescue? Extracting" Situational Awareness" Tweets During Mass Emergency

In times of mass emergency, vast amounts of data are generated via computer-mediated communication (CMC) that are difficult to manually cull and organize into a coherent picture. Yet valuable information is broadcast, and can provide useful insight into time-and safety-critical situations if captured and analyzed properly and rapidly. We describe an approach for automatically identifying messages communicated via Twitter that contribute to situational awareness, and explain why it is beneficial for those seeking information during mass emergencies.

A language-agnostic approach to exact informative tweets during emergency situations

2017 IEEE International Conference on Big Data (Big Data), 2017

In this paper, we propose a machine learning approach to automatically classify non-informative and informative contents shared on Twitter during disasters caused by natural hazards. In particular, we leverage on previously sampled and labeled datasets of messages posted on Twitter during or in the aftermath of natural disasters. Starting from results obtained in previous studies, we propose a language-agnostic model. We define a base feature set considering only Twitter-specific metadata of each tweet, using classification results from this set as a reference. We introduce an additional feature, called the Source Feature, which is computed considering the device or platform used to post a tweet, and we evaluate its contribution in improving the classifier accuracy. Index Terms-Disaster relief; social media analysis; classification; machine learning; real-world traces.

CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing

Proceedings of the International AAAI Conference on Web and Social Media

The time-critical analysis of social media streams is important for humanitarian organizations to plan rapid response during disasters. The crisis informatics research community has developed several techniques and systems to process and classify big crisis-related data posted on social media. However, due to the dispersed nature of the datasets used in the literature, it is not possible to compare the results and measure the progress made towards better models for crisis informatics. In this work, we attempt to bridge this gap by combining various existing crisis-related datasets. We consolidate eight annotated data sources and provide 166.1k and 141.5k tweets for informativeness and humanitarian classification tasks, respectively. The consolidation results in a larger dataset that affords the ability to train more sophisticated models. To that end, we provide binary and multiclass classification results using CNN, FastText, and transformer based models to address informativeness a...

User-Assisted Information Extraction from Twitter During Emergencies

2017

Disasters and emergencies bring uncertain situations. People involved in such situations look for quick answers to their rapid queries. Moreover, humanitarian organizations look for situational awareness information to launch relief operations. Existing studies show the usefulness of social media content during crisis situations. However, despite advances in information retrieval and text processing techniques, access to relevant information on Twitter is still a challenging task. In this paper, we propose a novel approach to provide timely access to the relevant information on Twitter. Specifically, we employee Word2vec embeddings to expand initial users queries and based on a relevance feedback mechanism we retrieve relevant messages on Twitter in real-time. Initial experiments and user studies performed using a real world disaster dataset show the significance of the proposed approach.

Standardizing and Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing

arXiv (Cornell University), 2020

Time-critical analysis of social media streams is important for humanitarian organizations to plan rapid response during disasters. The crisis informatics research community has developed several techniques and systems to process and classify big crisis related data posted on social media. However, due to the dispersed nature of the datasets used in the literature, it is not possible to compare the results and measure the progress made towards better models for crisis informatics. In this work, we attempt to bridge this gap by standardizing various existing crisis-related datasets. We consolidate labels of eight annotated data sources and provide 166.1k and 141.5k tweets for informativeness and humanitarian classification tasks, respectively. The consolidation results in a larger dataset that affords the ability to train more sophisticated models. To that end, we provide baseline results using CNN and BERT models. We make the dataset available at https://crisisnlp.qcri.org/crisis\_datasets\_benchmarks.html.

Extracting Valuable Information from Twitter during Natural Disasters

Social media is a vital source of information during any major event, especially natural disasters. However, with the exponential increase in volume of social media data, so comes the increase in conversational data that does not provide valuable information, especially in the context of disaster events, thus, diminishing peoples' ability to find the information that they need in order to organize relief efforts, find help, and potentially save lives. This project focuses on the development of a Bayesian approach to the classification of tweets (posts on Twitter) during Hurricane Sandy in order to distinguish "informational" from "conversational" tweets. We designed an effective set of features and used them as input to Naïve Bayes classifiers. In comparison to a "bag of words" approach, the new feature set provides similar results in the classification of tweets. However, the designed feature set contains only 9 features compared with more than 3000 features for "bag of words." When the feature set is combined with "bag of words", accuracy achieves 85.2914%. If integrated into disaster-related systems, our approach can serve as a boon to any person or organization seeking to extract useful information in the midst of a natural disaster.

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Cornell University - arXiv, 2022

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data-a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HUMSET, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HUMSET provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HUMSET also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of experiments on Pre-trained Language Models (PLM) to establish strong baselines for future research in this domain. The dataset is available at https://blog. thedeep.io/humset/.

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crises Response

2022

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data-a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HUMSET, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HUMSET provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HUMSET also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of experiments on Pre-trained Language Models (PLM) to establish strong baselines for future research in this domain. The dataset is available at https: //blog.thedeep.io/humset/.

Aid is Out There: Looking for Help from Tweets during a Large Scale Disaster

The 2011 Great East Japan Earthquake caused a wide range of problems, and as countermeasures, many aid activities were carried out. Many of these problems and aid activities were reported via Twitter. However, most problem reports and corresponding aid messages were not successfully exchanged between victims and local governments or humanitarian organizations, overwhelmed by the vast amount of information. As a result, victims could not receive necessary aid and humanitarian organizations wasted resources on redundant efforts. In this paper, we propose a method for discovering matches between problem reports and aid messages. Our system contributes to problem-solving in a large scale disaster situation by facilitating communication between victims and humanitarian organizations.

Information Extraction from Microblog for Disaster Related Event

2017

This paper presents the participation of Information Retrieval Lab(IRLAB) at DAIICT Gandhinagar ,India in Data challenge track of SMERP 2017. This year SMERP Data challenge track has offered a task called Text Extraction on the Italy earthquake tweet dataset, with an objective to retrieve relevant tweets with high recall and high precision. In this task, three runs were submitted by us and we describe the different approaches adopted. Initially, we have performed query expansion on the topics using Wordnet. In the first run, we have ranked tweets using cosine similarity against the topics. In the second run, relevance score between tweets and the topic is calculated using Okapi BM25 ranking function and in the third run relevance score is calculated using language model with Jelinek-Mercer smoothing .