Resources for Research on Crisis Informatics Topics
The following resources are made available to help researchers and technologists to advance research on humanitarian and crisis computing by developing new computational models, innovative techniques, and systems useful for humanitarian aid.
This resource consists of Twitter data collected during 19 natural and human-induced disasters. Each dataset contains tweet-ids and human-labeled tweets of the event. Moreover, it contains a dictionary of out-of-vocabulary(OOV) words, a word2vec model, and a tweets downloader tool. Please cite the following paper, if you use any of these resources in your research.
Muhammad Imran, Prasenjit Mitra, and Carlos Castillo: Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), pp. 1638-1643. May 2016, Portorož, Slovenia. [Bibtex]
Resource details and downloading »
This resource consists of human-labeled tweets collected during the 2012 Hurricane Sandy and the 2011 Joplin tornado. Please cite the following paper, if you use this resource in your research.
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. Practical Extraction of Disaster-Relevant Information from Social Media. In Proceedings of the 22nd international conference on World Wide Web companion, May 2013, Rio de Janeiro, Brazil. [Bibtex]
Dataset
This resource consists of human-labeled tweets collected during the 2011 Joplin tornado and labeled into humanitarina categories. Please cite the following paper, if you use this resource in your research.
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. Extracting Information Nuggets from Disaster-Related Messages in Social Media.In Proceedings of the 10th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 2013, Baden-Baden, Germany. [Bibtex]
Dataset
This resource provides read-to-use Python implementation of a number of neural network and non-neural network baesd classifiers for the classification of crisis-related Twitter data. Please cite the following paper, if you use this resource in your research.
Dat Tien Nguyen, Kamela Ali Al-Mannai, Shafiq Joty, Hassan Sajjad, Muhammad Imran, Prasenjit Mitra. Robust Classification of Crisis-Related Data on Social Networks using Convolutional Neural Networks. In Proceedings of the 11th International AAAI Conference on Web and Social Media (ICWSM), 2017, Montreal, Canada.
Resource details and downloading »
This resource provides human-labeled multimodal datasets comprised of tweets and images collected during seven major natural disasters. Please cite the following paper, if you use this resource in your research.
Firoj Alam, Ferda Ofli, and Muhammad Imran, CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA. [Bibtex]
Resource details and downloading »
This resource comprised of tweet-ids and a sample of raw tweets (50k) collected during three devastating hurricanes in 2017 namely Hurricane Harvey, Hurricane Irma, and Hurricane Maria.
Firoj Alam, Ferda Ofli, Muhammad Imran, Michael Aupetit. A Twitter Tale of Three Hurricanes: Harvey, Irma, and Maria. In Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 2018, Rochester NY, USA. [Bibtex]
Dataset (~64MB)
This resource comprised of human-labeled tweets collected from the 2015 Nepal earthquake and the 2013 Queensland floods.
Firoj Alam, Shafiq Joty, Muhammad Imran. Domain Adaptation with Adversarial Training and Graph Embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018, Melbourne, Australia. [Bibtex]
Dataset (~7MB)
This resource is Java-based tool to download full tweets content using tweet ids. This tool can make 180 API calls per 15 minutes, each API call downloads up to 100 tweets i.e. it can download up to 72,000 tweets per hour.
Tool
This corpus comprises images collected from Twitter during four natural disasters, namely Typhoon Ruby (2014), Nepal Earthquake (2015), Ecuador Earthquake (2016), and Hurricane Matthew (2016). In addition to Twitter images, it contains images collected from Google using queries such as "damage building", "damage bridge", and "damage road" to deal with labeled data scarcity problem.
Dat Tien Nguyen, Ferda Ofli, Muhammad Imran, Prasenjit Mitra. Damage Assessment from Social Media Imagery Data During Disasters. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2017, Sydney, Australia. [Bibtex]
Labeled images (~5GB)
This resource comprised of human-labeled tweets collected from the 2015 Nepal earthquake and the 2013 Queensland floods.
Firoj Alam, Shafiq Joty, Muhammad Imran. Graph Based Semi-supervised Learning with Convolutional Neural Networks to Classify Crisis Related Tweets. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA.
Dataset (~7MB)
This resource comprised of data related to the 2018 California wildfires (a.k.a Camp Fire). Specifically, it contains (1) the names of missing and found people, (2) web sources from which the names were taken, (3) hashtags related to missing, lost and found people. We will publish tweet-ids soon.
Humaira Waqas, Muhammad Imran. #CampFireMissing: An Analysis of Tweets About Missing and Found People during California Wildfires. In Proceedings of the 16th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2019, Valencia, Spain.
Tweet-ids Meta-data
This resource comprised of ~14,000 labeled tweets collected during several natural disasters including hurricanes, earthquakes, floods, and forest fires. The tweets were annotated following an eyewintess taxonomy defined in the below paper.
Kiran Zahra, Muhammad Imran, Frank Ostermann. Automatic Identification of Eyewitness Messages on Twitter During Disasters. In the Journal of Information Processing & Management (IP&M), 2020.
Dataset
This resource contains images and expert annotations for the detection of damaged heritage sites. The images were downloaded from Google and annotated using two annotation schemes. First, if an image shows heritage site or not. Second, if a heritage image shows some damage content or not.
Pakhee Kumar, Ferda Ofli, Muhammad Imran, Carlos Castillo. Detection of Disaster-Affected Cultural Heritage Sites from Social Media Images Using Deep Learning Techniques. Accepted in the ACM Journal on Computing and Cultural Heritage (JOCCH), 2020.
Resource details and downloading »
This resource contains more than 500 Million tweets related to the COVID-19 pandemic. The geographic coverage of the dataset spans over 218 countries and 47K cities around the globe. Moreover, the dataset covers 62 international languages.
Umair Qazi, Muhammad Imran, Ferda Ofli. GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information To appear in ACM SIGSPATIAL Special, May, 2020. (arXiv)
Resource details and downloading »
The crisis image benchmark dataset consists data from several data sources such as Disasters on Social Media (DSM), CrisisMMD and data from AIDR. The purpose of this work was develop a consolidated dataset, and create non-overlapping train/dev/test set and provide a benchmark results for the community.
Firoj Alam, Ferda Ofli, Muhammad Imran, Tanvirul Alam, Umair Qazi, Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response, In 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2020. [Bibtex]
Resource details and downloading »
The crisis benchmark dataset consists data from several different data sources such as CrisisLex (CrisisLex26, CrisisLex6), CrisisNLP, SWDM2013, ISCRAM13, Disaster Response Data (DRD), Disasters on Social Media (DSM), CrisisMMD and data from AIDR. The purpose of this work was to map the class label, remove duplicates and provide a benchmark results for the community.
Firoj Alam, Hassan Sajjad, Muhammad Imran and Ferda Ofli, CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing, In ICWSM, 2021. [Bibtex]
Resource details and downloading »
RESOURCE # 17 NEW
The HumAID Twitter dataset consists of several thousands of manually annotated tweets that has been collected during 19 major natural disaster events including earthquakes, hurricanes, wildfires, and floods, which happened from 2016 to 2019 across different parts of the World.
Firoj Alam, Umair Qazi, Muhammad Imran and Ferda Ofli, HumAID: Human-Annotated Disaster Incidents Data from Twitter, In ICWSM, 2021. [Bibtex]
Resource details and downloading »
RESOURCE # 18 NEW
This resource provides access to the TBCOV dataset that comprises more than two billion multilingual tweets related to the COVID-19 pandemic. TBCOV offers 2,014,792,896 tweets collected using more than 800 multilingual keywords over a 14-month period with sentiment, named entities, geo, and gender labels.
Muhammad Imran, Umair Qazi, Ferda Ofli, TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels, Preprint on ArXiv, October, 2021.
Resource details and downloading »
Subscribe to CrisisNLP to receive announcements about these and new resources. Follow us on Twitter: @NLP4Crisis For inquiries, issues, feedback, or collaborations, contact: Admins