CrisisNLP

Home
About

Resources for Research on Crisis Informatics Topics

The CrisisNLP repository, maintained by the Crisis Computing team at the Qatar Computing Research Institute, provides curated datasets, tools, and benchmarks that enable researchers and practitioners to build computational models and practical systems in support of humanitarian aid and crisis response.

RESOURCE # 1

This resource consists of Twitter data collected during 19 natural and human-induced disasters. Each dataset contains tweet-ids and human-labeled tweets from the event. In addition, it includes a dictionary of out-of-vocabulary (OOV) words, a word2vec model, and a tweet downloader tool. Please cite the following paper if you use any of these resources in your research.

Muhammad Imran, Prasenjit Mitra, and Carlos Castillo: Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), pp. 1638-1643. May 2016, Portorož, Slovenia. [Bibtex]

Resource details and downloading »

RESOURCE # 2

This resource consists of human-labeled tweets collected during the 2012 Hurricane Sandy and the 2011 Joplin tornado. Please cite the following paper if you use this resource in your research.

Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. Practical Extraction of Disaster-Relevant Information from Social Media. In Proceedings of the 22nd international conference on World Wide Web companion, May 2013, Rio de Janeiro, Brazil. [Bibtex]

Dataset

RESOURCE # 3

This resource consists of human-labeled tweets collected during the 2011 Joplin tornado and categorized into humanitarian classes. Please cite the following paper if you use this resource in your research.

Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. Extracting Information Nuggets from Disaster-Related Messages in Social Media.In Proceedings of the 10th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 2013, Baden-Baden, Germany. [Bibtex]

Dataset

RESOURCE # 4

This resource provides a ready-to-use Python implementation of several neural network and non-neural network based classifiers for the classification of crisis-related Twitter data. Please cite the following paper if you use this resource in your research.

Dat Tien Nguyen, Kamela Ali Al-Mannai, Shafiq Joty, Hassan Sajjad, Muhammad Imran, Prasenjit Mitra. Robust Classification of Crisis-Related Data on Social Networks using Convolutional Neural Networks. In Proceedings of the 11th International AAAI Conference on Web and Social Media (ICWSM), 2017, Montreal, Canada.

Resource details and downloading »

RESOURCE # 5

This resource provides human-labeled multimodal datasets comprising tweets and images collected during seven major natural disasters. Please cite the following paper if you use this resource in your research.

Firoj Alam, Ferda Ofli, and Muhammad Imran, CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA. [Bibtex]

Resource details and downloading »

RESOURCE # 6

This resource consists of tweet-ids and a sample of raw tweets (50k) collected during three devastating hurricanes in 2017, namely Hurricane Harvey, Hurricane Irma, and Hurricane Maria.

Firoj Alam, Ferda Ofli, Muhammad Imran, Michael Aupetit. A Twitter Tale of Three Hurricanes: Harvey, Irma, and Maria. In Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 2018, Rochester NY, USA. [Bibtex]

Dataset (~64MB)

RESOURCE # 7

This resource consists of human-labeled tweets collected during the 2015 Nepal earthquake and the 2013 Queensland floods.

Firoj Alam, Shafiq Joty, Muhammad Imran. Domain Adaptation with Adversarial Training and Graph Embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018, Melbourne, Australia. [Bibtex]

Dataset (~7MB)

RESOURCE # 8

This resource is a Java-based tool that downloads full tweet content using tweet-ids. The tool can make 180 API calls per 15 minutes, and each API call retrieves up to 100 tweets, enabling it to download up to 72,000 tweets per hour.

Tool

RESOURCE # 9

This corpus comprises images collected from Twitter during four natural disasters, namely Typhoon Ruby (2014), Nepal Earthquake (2015), Ecuador Earthquake (2016), and Hurricane Matthew (2016). In addition to Twitter images, it includes images collected from Google using queries such as "damage building", "damage bridge", and "damage road" to address the scarcity of labeled data.

Dat Tien Nguyen, Ferda Ofli, Muhammad Imran, Prasenjit Mitra. Damage Assessment from Social Media Imagery Data During Disasters. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2017, Sydney, Australia. [Bibtex]

Labeled images (~5GB)

RESOURCE # 10

This resource consists of human-labeled tweets collected during the 2015 Nepal earthquake and the 2013 Queensland floods.

Firoj Alam, Shafiq Joty, Muhammad Imran. Graph Based Semi-supervised Learning with Convolutional Neural Networks to Classify Crisis Related Tweets. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA.

Dataset (~7MB)

RESOURCE # 11

This resource consists of data related to the 2018 California wildfires (a.k.a. Camp Fire). Specifically, it contains (1) the names of missing and found people, (2) the web sources from which the names were taken, and (3) hashtags related to missing, lost, and found people. Tweet-ids will be published soon.

Humaira Waqas, Muhammad Imran. #CampFireMissing: An Analysis of Tweets About Missing and Found People during California Wildfires. In Proceedings of the 16th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2019, Valencia, Spain.

Tweet-ids Meta-data

RESOURCE # 12

This resource consists of ~14,000 labeled tweets collected during several natural disasters, including hurricanes, earthquakes, floods, and forest fires. The tweets were annotated following the eyewitness taxonomy defined in the paper below.

Kiran Zahra, Muhammad Imran, Frank Ostermann. Automatic Identification of Eyewitness Messages on Twitter During Disasters. In the Journal of Information Processing & Management (IP&M), 2020.

Dataset

RESOURCE # 13

This resource contains images and expert annotations for the detection of damaged heritage sites. The images were downloaded from Google and annotated using two schemes: first, whether an image shows a heritage site or not; and second, whether a heritage image shows damage content or not.

Pakhee Kumar, Ferda Ofli, Muhammad Imran, Carlos Castillo. Detection of Disaster-Affected Cultural Heritage Sites from Social Media Images Using Deep Learning Techniques. Accepted in the ACM Journal on Computing and Cultural Heritage (JOCCH), 2020.

Resource details and downloading »

RESOURCE # 14

This resource contains more than 500 million tweets related to the COVID-19 pandemic. The geographic coverage of the dataset spans 218 countries and 47K cities around the globe, and it covers 62 international languages.

Umair Qazi, Muhammad Imran, Ferda Ofli. GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information To appear in ACM SIGSPATIAL Special, May, 2020. (arXiv)

Resource details and downloading »

RESOURCE # 15

The crisis image benchmark dataset consists of data from several sources, such as Disasters on Social Media (DSM), CrisisMMD, and data from AIDR. The purpose of this work was to develop a consolidated dataset, create non-overlapping train/dev/test splits, and provide benchmark results for the community.

Firoj Alam, Ferda Ofli, Muhammad Imran, Tanvirul Alam, Umair Qazi, Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response, In 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2020. [Bibtex]

Resource details and downloading »

RESOURCE # 16

The crisis benchmark dataset consists of data from several different sources, such as CrisisLex (CrisisLex26, CrisisLex6), CrisisNLP, SWDM2013, ISCRAM13, Disaster Response Data (DRD), Disasters on Social Media (DSM), CrisisMMD, and data from AIDR. The purpose of this work was to map the class labels, remove duplicates, and provide benchmark results for the community.

Firoj Alam, Hassan Sajjad, Muhammad Imran and Ferda Ofli, CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing, In ICWSM, 2021. [Bibtex]

Resource details and downloading »

RESOURCE # 17

The HumAID Twitter dataset consists of several thousand manually annotated tweets collected during 19 major natural disaster events, including earthquakes, hurricanes, wildfires, and floods, which occurred from 2016 to 2019 across different parts of the world.

Firoj Alam, Umair Qazi, Muhammad Imran and Ferda Ofli, HumAID: Human-Annotated Disaster Incidents Data from Twitter, In ICWSM, 2021. [Bibtex]

Resource details and downloading »

RESOURCE # 18

This resource provides access to the TBCOV dataset, which comprises more than two billion multilingual tweets related to the COVID-19 pandemic. TBCOV offers 2,014,792,896 tweets collected over a 14-month period using more than 800 multilingual keywords, enriched with sentiment, named entity, geo, and gender labels.

Muhammad Imran, Umair Qazi, Ferda Ofli, TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels, Preprint on ArXiv, October, 2021.

Resource details and downloading »

RESOURCE # 19

This resource provides access to "MEDIC", a multi-task learning dataset for disaster-related image classification. MEDIC is an extended version of the crisis image benchmark dataset. It consists of data from several sources, including CrisisMMD, data from AIDR, and the Damage Multimodal Dataset (DMD). The dataset contains 71,198 images.

Firoj Alam, Tanvirul Alam, Md. Arid Hasan, Abul Hasnat, Muhammad Imran, Ferda Ofli, MEDIC: A Multi-Task Learning Dataset for Disaster Image Classification. Neural Computing and Applications, 35(3):2609–2632, 2023.

Resource details and downloading »

RESOURCE # 20

This resource provides access to "Incidents1M", a large-scale multi-label dataset for the recognition of natural disasters, damage, and incidents from images. The dataset contains approximately 1 million images annotated across 43 incident categories (e.g., earthquake, flood, wildfire, landslide, building collapse) and 49 place categories, with both positive and negative labels to mitigate false positives. It is designed to enable training of robust visual models for large-scale incident detection in social media and web imagery.

Ethan Weber, Dim P. Papadopoulos, Agata Lapedriza, Ferda Ofli, Muhammad Imran, Antonio Torralba. Incidents1M: A Large-Scale Dataset of Images with Natural Disasters, Damage, and Incidents. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(4):4768-4781, 2022.

Project page Code & dataset

RESOURCE # 21

This resource provides access to "IDRISI", a family of large-scale datasets and benchmarks for Location Mention Recognition (LMR) on disaster-related tweets. IDRISI-RE is a generalizable English dataset spanning multiple disaster types and events, while IDRISI-RA is the first Arabic LMR dataset for disaster tweets. The datasets are designed to enable training and evaluation of models that extract fine-grained toponyms from social media to support situational awareness and humanitarian response.

Reem Suwaileh, Tamer Elsayed, Muhammad Imran. IDRISI-RE: A Generalizable Dataset with Benchmarks for Location Mention Recognition on Disaster Tweets. Information Processing & Management, 60(3):103340, 2023.

Reem Suwaileh, Muhammad Imran, Tamer Elsayed. IDRISI-RA: The First Arabic Location Mention Recognition Dataset of Disaster Tweets. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023, Toronto, Canada.

Code & dataset

RESOURCE # 22 NEW

This resource provides access to "DisasterVQA", a visual question answering benchmark dataset for disaster scenes. It pairs disaster imagery with natural-language questions and answers that probe disaster type, affected entities, severity, and humanitarian relevance, enabling the evaluation of vision-language models on disaster understanding tasks beyond simple image classification.

Aisha Al-Mohannadi, Ayisha Firoz, Yin Yang, Muhammad Imran, Ferda Ofli. DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes. In Proceedings of the 20th International AAAI Conference on Web and Social Media (ICWSM), 2026.

Dataset

Please carefully read our Terms of use before using resources available on this site.

Subscribe to CrisisNLP to receive announcements about these and new resources.
Follow us on Twitter: @NLP4Crisis
For inquiries, issues, feedback, or collaborations, contact: Admins