GeoCoV19 Dataset Description
============================

The GeoCoV19 Dataset comprises two sets of files:

1. Tweet ids & user ids (TSV)
2. Tweet ids with geolocation information (JSON)

Geo files format:
================

{
  "tweet_id": "1223655173056356353",
  "created_at": "Sat Feb 01 17:11:42 +0000 2020",
  "user_id": "3352471150",
  "geo_source": "user_location",
  "user_location": {
    "country_code": "br"
  },
  "geo": {},
  "place": {
    
  },
  "tweet_locations": [
    {
      "country_code": "it",
      "state": "Trentino-Alto",
      "county": "Pustertal - Val Pusteria"
    },
    {
      "country_code": "us"
    },
    {
      "country_code": "ru",
      "state": "Voronezh Oblast",
      "county": "Petropavlovsky District"
    },
    {
      "country_code": "at",
      "state": "Upper Austria",
      "county": "Braunau am Inn"
    },
    {
      "country_code": "it",
      "state": "Trentino-Alto",
      "county": "Pustertal - Val Pusteria"
    },
    {
      "country_code": "cn"
    },
    {
      "country_code": "in",
      "state": "Himachal Pradesh",
      "county": "Jubbal"
    }
  ]
}


Description of all the fields in the above JSON
===============================================

Each JSON in the Geo file has the following eight keys:

1. Tweet_id: it represents the Twitter provided id of a tweet
2. Created_at: it represents the Twitter provided "created_at" date and time in UTC 
3. User_id: it represents the Twitter provided user id
4. Geo_source: this field shows one of the four values: (i) coordinates, (ii) place, (iii) user_location, or (iv) tweet_text. The value depends on the availability of these fields. However, priority is given to the most accurate fields if available. The priority order is coordinates, places, user_location, and tweet_text. For instance, when a tweet has GPS coordinates, the value will be "coordinates" even though all other location fields are present. If a tweet does not have GPS, place, and user_location information, then the value of this field will be "tweet_text" if there is any location mention in the tweet text. 

The remaining keys can have the following location_json inside them.
Sample location_json: {"country_code":"us","state":"California","county":"San Francisco","city":"San Francisco"}.
Depending on the available granularity, country_code, state, county or city keys can be missing in the location_json.

5. user_location: It can have a "location_json" as described above or an empty JSON {}. This field uses the "location" profile meta-data of a Twitter user and represents the user declared location in the text format. We resolve the text to a location. 
6. geo: represents the "geo" field provided by Twitter. We resolve the provided latitude and longitude values to locations. It can have a "location_json" as described above or an empty JSON {}.
7. tweet_locations: This field can have an array of "location_json" as described above [location_json1, location_json2] or an empty array []. This field uses the tweet content (i.e., actual tweet message) to find toponyms. A tweet message can have several mentions of different locations (i.e., toponyms). That is why we have an array of locations representing all those toponyms in a tweet. For instance, in a tweet like "The UK has over 65,000 #COVID19 deaths. More than Qatar, Pakistan, and Norway.", there are four location mentions. Our tweet_locations array should represent these four separately. 
8. place: It can have a "location_json" described above or an empty JSON {}. It represents the Twitter-provided "place" field.

If you have doubts or questions, feel free to contact us at: 
uqazi@hbku.edu.qa and mimran@hbku.edu.qa