CrisisNLP

Home
About

CrisisMMD: Multimodal Crisis Dataset

Description of the dataset

The CrisisMMD multimodal Twitter dataset consists of several thousands of manually annotated tweets and images collected during seven major natural disasters including earthquakes, hurricanes, wildfires, and floods that happened in the year 2017 across different parts of the World. The provided datasets include three types of annotations.

Change log version 2.0: In this version of this dataset, we mapped "Not relevant or can't judge" to "Not humanitarian" for the humanitarian task. Also the "Not informative" label from informative task also mapped to "Not humanitarian" for the humanitarian task. We also removed duplicate entries that appeared while combined the tweets from different events. Both informative and humanitarian tasks are now aligned and can be useful for multitask classification learning.

Task 1: Informative vs Not informative
- Informative
- Not informative
- Don't know or can't judge --> removed in version 2.0
Task 2: Humanitarian categories
- Affected individuals
- Infrastructure and utility damage
- Injured or dead people
- Missing or found people
- Rescue, volunteering or donation effort
- Vehicle damage
- Other relevant information
- Not relevant or can't judge --> updated to Not humanitarian in version 2.0
Task 3: Damage severity assessment
- Severe damage
- Mild damage
- Little or no damage
- Don't know or can't judge

Explore dataset

Please check how the annotations look like: https://aidr-dev2.qcri.org/apps/crisismmd

Datasets details:

Keywords used to collect the tweets for this dataset with data collection start and end dates for each event.

Crisis name	Keywords	Start date	End date
Hurricane Irma	Hurricane Irma, Irma storm, Storm Irma, Irma Hurricane, Irma	Sep 6 2017	Sep 21 2017
Hurricane Harvey	Hurricane Harvey, Harvey, HurricaneHarvey, Tornado	August 25 2017	September 20 2017
Hurricane Maria	Hurricane Maria, Maria Storm, Maria Cyclone, Maria Tornado, Tropical Storm Maria, HurricaneMaria, puerto rico	September 20 2017	November 13 2017
California wildfires	California fire, California wildfire, Wildfire California, USA Wildfire, California wildfires	October 10 2017	October 27 2017
Mexico earthquake	mexico earthquake, mexicoearthquake	September 20 2017	October 6 2017
Iraq-Iran earthquake	kuwait earthquake, iran earthquake, halabja earthquake, Iraq earthquake	November 13 2017	November 19 2017
Sri Lanka floods	flood Sri Lanka, FloodSL, SriLanka flooding, SriLanka floods, SriLanka flood, typhoon mora, cyclone mora, mora, CycloneMora	May 31 2017	July 3 2017

Event-wise data distribution

For each event, we collected tweets and associated images, filtered and sampled them for the annotation.

Data distribution from the CrisisMMD version v1.0 [2]
Crisis name	# tweets	# images	# filtered tweets	# sampled tweets	# sampled images
Hurricane Irma	3,517,280	176,972	5,739	4,041	4,525
Hurricane Harvey	6,664,349	321,435	19,967	4,000	4,443
Hurricane Maria	2,953,322	52,231	6,597	4,000	4,562
California wildfires	455,311	10,130	1,488	1,486	1,589
Mexico earthquake	383,341	7,111	1,241	1,239	1,382
Iraq-Iran earthquake	207,729	6,307	501	499	600
Sri Lanka floods	41,809	2,108	870	832	1,025
Total	14,223,141	576,294	36,403	16,097	18,126

Data preparation for multimodal baseline

For the multimodal baseline experiments, we first combined the tweet text and image from all events. It resulted in 24 duplicate entries (tweet ids: text and associated images). We manually checked these duplicate entries and kept the one, which were annotated properly. We changed the label “Not relevant or can’t judge” to “Not humanitarian”. In addition, as the annotation consists of a label - “don't know or can't not judge”, we also removed them for the classification experiments. Hence, this preprocessing part filtered out 39 tweets and associated 44 images. The resulted total dataset consists of 16058 and 18082 tweet texts and images, respectively as shown in the following table. This version of this dataset is released as version 2.0 and is available for download.

Data distribution from the CrisisMMD version v2.0 [1]
Informativeness
	Text	Image
Informative	11509	9374
Not informative	4549	8708
Total	16058	18082
Humanitarian
Affected individuals	472	562
Infrastructure and utility damage	1210	3624
Injured or dead people	486	110
Missing or found people	40	14
Not humanitarian	4549	8708
Other relevant information	5954	2529
Rescue volunteering or donation effort	3293	2231
Vehicle damage	54	304
Total	16058	18082
Damage Severity
Little or no damage	-	475
Mild damage	-	839
Severe damage	-	2212
Total	-	3,526

Social media data poses great challenges and one of the major change is multimodal data, in which there is no strong alignment between modalities (i.e., same text and image pair have different labels). For this study and as a preliminary experiments, we only selected the ones with agreed labels. This resulted skewed label distribution for model training. To deal with this issue, we combine those minority categories that are semantically similar or relevant. Specifically, we merge the “injured or dead people” and “missing or found people” categories into the “affected individuals” category. Similarly, we merge “vehicle damage” category into the “infrastructure and utility damage” category. As a result, we are left with five categories for the humanitarian task. Please check our paper “Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response” below to learn more. For the future research study, we release the data splits with and without agreed labels. The data split files (see link below) contains labels and the associated information. For the image please download CrisisMMD dataset.

Downloads: Labeled data and other resources

CrisisMMD dataset version v2.0: Labeled images & tweets (~1.8GB)
Datasplit: Annotations
Datasplit for multimodal baseline results with agreed labels (see [1]): Annotations

CrisisMMD dataset version v1.0: Labeled images & tweets (~1.8GB) Tweet-ids (79M)

Please cite the following papers, if you use any of these resources in your research.

Ferda Ofli, Firoj Alam, and Muhammad Imran, Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response, In Proceedings of the 17th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2020, USA. [Bibtex]
Firoj Alam, Ferda Ofli, and Muhammad Imran, CrisisMMD: Multimodal Twitter Datasets from Natural Disasters, In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA. [Bibtex]