CrisisMMD: Multimodal Crisis Dataset

Description of the dataset

The CrisisMMD multimodal Twitter dataset consists of several thousands of manually annotated tweets and images collected during seven major natural disasters including earthquakes, hurricanes, wildfires, and floods that happened in the year 2017 across different parts of the World. The provided datasets include three types of annotations.

Change log version 2.0: In this version of this dataset, we mapped "Not relevant or can't judge" to "Not humanitarian" for the humanitarian task. Also the "Not informative" label from informative task also mapped to "Not humanitarian" for the humanitarian task. We also removed duplicate entries that appeared while combined the tweets from different events. Both informative and humanitarian tasks are now aligned and can be useful for multitask classification learning.

Explore dataset

Please check how the annotations look like:

Datasets details:

Keywords used to collect the tweets for this dataset with data collection start and end dates for each event.
Crisis name Keywords Start date End date
Hurricane Irma Hurricane Irma, Irma storm, Storm Irma, Irma Hurricane, Irma Sep 6 2017 Sep 21 2017
Hurricane Harvey Hurricane Harvey, Harvey, HurricaneHarvey, Tornado August 25 2017 September 20 2017
Hurricane Maria Hurricane Maria, Maria Storm, Maria Cyclone, Maria Tornado, Tropical Storm Maria, HurricaneMaria, puerto rico September 20 2017 November 13 2017
California wildfires California fire, California wildfire, Wildfire California, USA Wildfire, California wildfires October 10 2017 October 27 2017
Mexico earthquake mexico earthquake, mexicoearthquake September 20 2017 October 6 2017
Iraq-Iran earthquake kuwait earthquake, iran earthquake, halabja earthquake, Iraq earthquake November 13 2017 November 19 2017
Sri Lanka floods flood Sri Lanka, FloodSL, SriLanka flooding, SriLanka floods, SriLanka flood, typhoon mora, cyclone mora, mora, CycloneMora May 31 2017 July 3 2017

Event-wise data distribution

For each event, we collected tweets and associated images, filtered and sampled them for the annotation.

Data distribution from the CrisisMMD version v1.0 [2]
Crisis name # tweets # images # filtered tweets # sampled tweets # sampled images
Hurricane Irma 3,517,280 176,972 5,739 4,041 4,525
Hurricane Harvey 6,664,349 321,435 19,967 4,000 4,443
Hurricane Maria 2,953,322 52,231 6,597 4,000 4,562
California wildfires 455,311 10,130 1,488 1,486 1,589
Mexico earthquake 383,341 7,111 1,241 1,239 1,382
Iraq-Iran earthquake 207,729 6,307 501 499 600
Sri Lanka floods 41,809 2,108 870 832 1,025
Total 14,223,141 576,294 36,403 16,097 18,126

Data preparation for multimodal baseline

For the multimodal baseline experiments, we first combined the tweet text and image from all events. It resulted in 24 duplicate entries (tweet ids: text and associated images). We manually checked these duplicate entries and kept the one, which were annotated properly. We changed the label “Not relevant or can’t judge” to “Not humanitarian”. In addition, as the annotation consists of a label - “don't know or can't not judge”, we also removed them for the classification experiments. Hence, this preprocessing part filtered out 39 tweets and associated 44 images. The resulted total dataset consists of 16058 and 18082 tweet texts and images, respectively as shown in the following table. This version of this dataset is released as version 2.0 and is available for download.

Data distribution from the CrisisMMD version v2.0 [1]
Text Image
Informative 11509 9374
Not informative 4549 8708
Total 16058 18082
Affected individuals 472 562
Infrastructure and utility damage 1210 3624
Injured or dead people 486 110
Missing or found people 40 14
Not humanitarian 4549 8708
Other relevant information 5954 2529
Rescue volunteering or donation effort 3293 2231
Vehicle damage 54 304
Total 16058 18082

Social media data poses great challenges and one of the major change is multimodal data, in which there is no strong alignment between modalities (i.e., same text and image pair have different labels). For this study and as a preliminary experiments, we only selected the ones with agreed labels. This resulted skewed label distribution for model training. To deal with this issue, we combine those minority categories that are semantically similar or relevant. Specifically, we merge the “injured or dead people” and “missing or found people” categories into the “affected individuals” category. Similarly, we merge “vehicle damage” category into the “infrastructure and utility damage” category. As a result, we are left with five categories for the humanitarian task. Please check our paper “Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response” below to learn more. For the future research study, we release the data splits with and without agreed labels. The data split files (see link below) contains labels and the associated information. For the image please download CrisisMMD dataset.

Downloads: Labeled data and other resources

Please cite the following papers, if you use any of these resources in your research.
  1. Ferda Ofli, Firoj Alam, and Muhammad Imran, Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response, In Proceedings of the 17th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2020, USA. [Bibtex]
  2. Firoj Alam, Ferda Ofli, and Muhammad Imran, CrisisMMD: Multimodal Twitter Datasets from Natural Disasters, In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA. [Bibtex]