GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information


GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets (as of May 1st). The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. We extract toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets annotated with geolocation using the user location field and 452 million tweets using tweet content.

Paper info:

Please cite the following paper, if you use this dataset in your derived research:
Umair Qazi, Muhammad Imran, Ferda Ofli. GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information. ACM SIGSPATIAL Special, May 2020. doi: 10.1145/3404111.3404114 (ACM | arXiv | bibtex)

General statistics:

Type Number
Total number of tweets (as of May 1st, still collecting) 524,353,432
Number of unique users 43,463,225
Number of countries covered 218
Number of cities covered 47,328
Number of languages 62

Geographical information statistics:

Type Number
Number of tweets with geo-coordinates 378,772
Number of tweets with place information 5,495,431
Number of tweets with user location information 297,148,292
Number of tweets with location mentions in tweet-text 452,933,900
Daily tweets distribution
Daily distribution of tweets (Feb 1 - May 1)

Keywords and hashtags used for data collection

As of May 1st, we have used around 800 hashtags and keywords for the data collection. Below, we list some of them. The complete set of keywords can be found here: keywords

Afghanistan Coronavirus, Albania Coronavirus, Algeria Coronavirus, Andorra Coronavirus, Angola Coronavirus, Argentina Coronavirus, Armenia Coronavirus, Australia Coronavirus, Austria Coronavirus, Azerbaijan Coronavirus, Bahamas Coronavirus, Bahrain Coronavirus, Bangladesh Coronavirus, Barbados Coronavirus, Belarus Coronavirus, Belgium Coronavirus, Belize Coronavirus, Benin Coronavirus, Bhutan Coronavirus, Bolivia Coronavirus, Bosnia Herzegovina Coronavirus, Botswana Coronavirus, Brazil Coronavirus, Brunei Coronavirus, Bulgaria Coronavirus, Burkina Coronavirus, Burundi Coronavirus, Cambodia Coronavirus, Cameroon Coronavirus, Canada Coronavirus, Cape Verde Coronavirus, Central African Rep Coronavirus, Chad Coronavirus, Chile Coronavirus, China Coronavirus, Colombia Coronavirus, Comoros Coronavirus, Congo Coronavirus, Congo Coronavirus, Costa Rica Coronavirus, Croatia Coronavirus, Cuba Coronavirus, Cyprus Coronavirus, Czech Republic Coronavirus, Denmark Coronavirus, Djibouti Coronavirus, Dominica Coronavirus, Dominican Republic Coronavirus, East Timor Coronavirus, Ecuador Coronavirus, Egypt Coronavirus, El Salvador Coronavirus, Equatorial Guinea Coronavirus, Eritrea Coronavirus, Estonia Coronavirus, Ethiopia Coronavirus, Fiji Coronavirus, Finland Coronavirus, France Coronavirus,....
Qatar COVID-19, Romania COVID-19, Russian Federation COVID-19, Rwanda COVID-19, St Lucia COVID-19, Samoa COVID-19, San Marino COVID-19, Saudi Arabia COVID-19, Senegal COVID-19, Serbia COVID-19, Seychelles COVID-19, Sierra Leone COVID-19, Singapore COVID-19, Slovakia COVID-19, Slovenia COVID-19, Solomon Islands COVID-19, Somalia COVID-19, South Africa COVID-19, South Sudan COVID-19, Spain COVID-19, Sri Lanka COVID-19, Sudan COVID-19, Suriname COVID-19, Swaziland COVID-19, Sweden COVID-19, Switzerland COVID-19, Syria COVID-19, Taiwan COVID-19, Tajikistan COVID-19, Tanzania COVID-19, Thailand COVID-19, Togo COVID-19, Tonga COVID-19, Tunisia COVID-19, Turkey COVID-19, Turkmenistan COVID-19, Tuvalu COVID-19, Uganda COVID-19, Ukraine COVID-19, United Arab Emirates COVID-19, UAE COVID-19, United Kingdom COVID-19, UK COVID-19, United States COVID-19, Uruguay COVID-19, Uzbekistan COVID-19, Vanuatu COVID-19, Vatican City COVID-19, Venezuela COVID-19, Vietnam COVID-19, Yemen COVID-19, Zambia COVID-19, Zimbabwe COVID-19, ....
Covid_19, #COVID_19uk, Covid19_DE, #covid19Canada, Covid19DE, Covid19Deutschland, #covid19espana, #covid19france, #covid19Indonesia, #covid19ireland, #covid19uk, #ForcaCoronaVirus, #infocoronavirus, #kamitidaktakutviruscorona, #nCoV, nCoV, #ncov2019, nCoV2019, NeuerCoronavirus, #NeuerCoronavirus, Nouveau coronavirus, #NouveauCoronavirus, novel coronavirus, #SARSCoV2, #SARSCoV2, فيروس كورونا, #فيروس_كورونا, #كورونا, #كورونا_الجديد, #कोरोना, कोरोना, कोरोना वायरस, #कोरोना_वायरस, 코로나, #코로나, #코로나19, 코로나바이러스, #코로나바이러스, コロナ, #コロナ, #コロナウイルス, 加油武汉, #加油武汉, #新冠病毒, 新冠病毒, #新冠肺炎, 新冠肺炎, #新型コロナウイルス, #新型冠状病毒, 新型冠状病毒, 武汉加油, #武汉加油, #武汉疫情, #武汉肺炎, 武汉肺炎, #武漢肺炎, 武漢肺炎, 疫情, #疫情, #CoronaAlert, #coronavirusUP, #coronavirustelangana, #coronaviruskerala, #coronavirusmumbai, #coronavirusdelhi, #coronavirusmaharashtra, #coronavirusinindia, वूहान, #कोविड_19, #कोविड-१९, #कोरोनावायरस, breathing issues, breathing difficulties, fever, cough, pandemic, Coronavirus usa, Coronavirus US, #coronaviruscalifornia, #CoronaVirusCA, #CoronaVirusSeattle, #Coronavirusnyc , #coronavirusnewyork, #Coronavirustexas, Coronavirus, #Coronavirus, #CoronavirusOutbreak, #2019nCov, coronavirus outbreak, COVID-19, #CoronaVirusitaly, CoronaVirus Iran, CoronaVirus Korean, CoronaVirus Japan, Corona, Coronavirus, Corona virus كورونا, #كورونا, #فيروس_كورونا, #كورونا_الجديد, #فيروس_كورونا_المستجد, #كورونا_مصر, #كورونا_قطر, #كورونا_السعودية, #كورونا_العراق, #كورونا_إيران, #كورونا_البحرين, #كورونا_الأردن, #كورونا_لبنان, #كورونا_الكويت #socialdistancing, social distancing, #NovelCorona, novelcoronavirus, #ohiocoronavirus, #PánicoPorCoranovirus, social distance, coronavirus, #N95, #Swiss, #coronavirus, Italia, lombardia, #covid19italia, covid19, corona virus, coronavirus, #covid-19, #corona, #coronavirus


Tweet ids:

The format of these files is TSV, where each line contains tweet id and user id.

Geo data:

The format of these files is JSON, where each line represents one tweet in this format: {tweet_id: xx, user_id: xx, created_at: xx, geo_source: xx, user_location: {one json}, geo: {one json}, tweet_locations: [array of jsons], place: {one json}}. Check the readme file for more details about these fileds: Readme

Updates: NEW

Below we provide updates about new data subsets, any corrections made, or upcoming releases.

Date Description Update material
June 11, 2020 This update provides correct location information for tweets with the "place" attribute. Only applicable to users who downloaded the geo data files before June 11th, 2020. A total number of 5,495,123 tweets have been updated. Update file (462MB)
July 05, 2020 This update provides the English subset of the whole GeoCoV19 data. We provide tweet ids and geo files only for English tweets. We rely on the Twitter provided "lang" attribute for filtering English tweets. See the download section

Tweets hydrators

CrisisNLP (Java)
Twarc (Python) »
Docnow (Desktop application) »