CrisisMMD: Multimodal Crisis Dataset
Description of the dataset
The CrisisMMD multimodal Twitter dataset consists of several thousands of manually annotated tweets and images collected during seven major natural disasters including earthquakes, hurricanes, wildfires, and floods that happened in the year 2017 across different parts of the World. The provided datasets include three types of annotations.
Change log version 2.0: In this version of this dataset, we mapped "Not relevant or can't judge" to "Not humanitarian" for the humanitarian task. Also the "Not informative" label from informative task also mapped to "Not humanitarian" for the humanitarian task. We also removed duplicate entries that appeared while combined the tweets from different events. Both informative and humanitarian tasks are now aligned and can be useful for multitask classification learning.
- Task 1: Informative vs Not informative
- Informative
- Not informative
- Don't know or can't judge --> removed in version 2.0
- Task 2: Humanitarian categories
- Affected individuals
- Infrastructure and utility damage
- Injured or dead people
- Missing or found people
- Rescue, volunteering or donation effort
- Vehicle damage
- Other relevant information
- Not relevant or can't judge --> updated to Not humanitarian in version 2.0
- Task 3: Damage severity assessment
- Severe damage
- Mild damage
- Little or no damage
- Don't know or can't judge
Explore dataset
Please check how the annotations look like: https://aidr-dev2.qcri.org/apps/crisismmd
Datasets details:
Keywords used to collect the tweets for this dataset with data collection start and end dates for each event.Crisis name | Keywords | Start date | End date |
---|---|---|---|
Hurricane Irma | Hurricane Irma, Irma storm, Storm Irma, Irma Hurricane, Irma | Sep 6 2017 | Sep 21 2017 |
Hurricane Harvey | Hurricane Harvey, Harvey, HurricaneHarvey, Tornado | August 25 2017 | September 20 2017 |
Hurricane Maria | Hurricane Maria, Maria Storm, Maria Cyclone, Maria Tornado, Tropical Storm Maria, HurricaneMaria, puerto rico | September 20 2017 | November 13 2017 |
California wildfires | California fire, California wildfire, Wildfire California, USA Wildfire, California wildfires | October 10 2017 | October 27 2017 |
Mexico earthquake | mexico earthquake, mexicoearthquake | September 20 2017 | October 6 2017 |
Iraq-Iran earthquake | kuwait earthquake, iran earthquake, halabja earthquake, Iraq earthquake | November 13 2017 | November 19 2017 |
Sri Lanka floods | flood Sri Lanka, FloodSL, SriLanka flooding, SriLanka floods, SriLanka flood, typhoon mora, cyclone mora, mora, CycloneMora | May 31 2017 | July 3 2017 |
Event-wise data distribution
For each event, we collected tweets and associated images, filtered and sampled them for the annotation.
Crisis name | # tweets | # images | # filtered tweets | # sampled tweets | # sampled images |
---|---|---|---|---|---|
Hurricane Irma | 3,517,280 | 176,972 | 5,739 | 4,041 | 4,525 |
Hurricane Harvey | 6,664,349 | 321,435 | 19,967 | 4,000 | 4,443 |
Hurricane Maria | 2,953,322 | 52,231 | 6,597 | 4,000 | 4,562 |
California wildfires | 455,311 | 10,130 | 1,488 | 1,486 | 1,589 |
Mexico earthquake | 383,341 | 7,111 | 1,241 | 1,239 | 1,382 |
Iraq-Iran earthquake | 207,729 | 6,307 | 501 | 499 | 600 |
Sri Lanka floods | 41,809 | 2,108 | 870 | 832 | 1,025 |
Total | 14,223,141 | 576,294 | 36,403 | 16,097 | 18,126 |
Data preparation for multimodal baseline
For the multimodal baseline experiments, we first combined the tweet text and image from all events. It resulted in 24 duplicate entries (tweet ids: text and associated images). We manually checked these duplicate entries and kept the one, which were annotated properly. We changed the label “Not relevant or can’t judge” to “Not humanitarian”. In addition, as the annotation consists of a label - “don't know or can't not judge”, we also removed them for the classification experiments. Hence, this preprocessing part filtered out 39 tweets and associated 44 images. The resulted total dataset consists of 16058 and 18082 tweet texts and images, respectively as shown in the following table. This version of this dataset is released as version 2.0 and is available for download.
Informativeness | ||
---|---|---|
Text | Image | |
Informative | 11509 | 9374 |
Not informative | 4549 | 8708 |
Total | 16058 | 18082 |
Humanitarian | ||
Affected individuals | 472 | 562 |
Infrastructure and utility damage | 1210 | 3624 |
Injured or dead people | 486 | 110 |
Missing or found people | 40 | 14 |
Not humanitarian | 4549 | 8708 |
Other relevant information | 5954 | 2529 |
Rescue volunteering or donation effort | 3293 | 2231 |
Vehicle damage | 54 | 304 |
Total | 16058 | 18082 |
Damage Severity | ||
Little or no damage | - | 475 |
Mild damage | - | 839 |
Severe damage | - | 2212 |
Total | - | 3,526 |
Social media data poses great challenges and one of the major change is multimodal data, in which there is no strong alignment between modalities (i.e., same text and image pair have different labels). For this study and as a preliminary experiments, we only selected the ones with agreed labels. This resulted skewed label distribution for model training. To deal with this issue, we combine those minority categories that are semantically similar or relevant. Specifically, we merge the “injured or dead people” and “missing or found people” categories into the “affected individuals” category. Similarly, we merge “vehicle damage” category into the “infrastructure and utility damage” category. As a result, we are left with five categories for the humanitarian task. Please check our paper “Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response” below to learn more. For the future research study, we release the data splits with and without agreed labels. The data split files (see link below) contains labels and the associated information. For the image please download CrisisMMD dataset.
Downloads: Labeled data and other resources
- CrisisMMD dataset version v2.0: Labeled images & tweets (~1.8GB)
- Datasplit: Annotations
- Datasplit for multimodal baseline results with agreed labels (see [1]): Annotations
- CrisisMMD dataset version v1.0: Labeled images & tweets (~1.8GB) Tweet-ids (79M)
- Ferda Ofli, Firoj Alam, and Muhammad Imran, Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response, In Proceedings of the 17th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2020, USA. [Bibtex]
- Firoj Alam, Ferda Ofli, and Muhammad Imran, CrisisMMD: Multimodal Twitter Datasets from Natural Disasters, In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA. [Bibtex]