COVID-19 Questions

This notebook explores the COVID-19 Questions dataset. The dataset containing categorized questions which were frequently asked by the public during the COVID-19 pandemic period.

The dataset is available to download freely on the IBM Developer Data Asset eXchange: COVID-19 Questions. This notebook can be found on Watson Studio: COVID-19 Questions Notebook

In [1]:
# Import Packages
import pandas as pd
import nltk
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
nltk.download('punkt')
import plotly.express as px
import requests
import tarfile
from os import path
[nltk_data] Downloading package punkt to /home/wsuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Download and Extract the Dataset

In [2]:
fname = 'covid-19-questions.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-covid-19-questions/1.0.0/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
Out[2]:
14900
In [3]:
# Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()
In [4]:
# Verifying the file was extracted properly
data_path = "covid-19-questions/"
path.exists(data_path)
Out[4]:
True

Explore Data

In [5]:
data = pd.read_csv('covid-19-questions/covid_19_questions.tsv', sep='\t')
In [6]:
# shape of the data
data.shape
Out[6]:
(844, 2)
In [7]:
# Display first 5 rows
data.head()
Out[7]:
label utterance
0 5.2.Quarantine_visits Can my friends visit me?
1 5.2.Quarantine_visits Can someone come see me during quarantine?
2 5.2.Quarantine_visits Is anyone allowed to visit during quarantine?
3 5.2.Quarantine_visits What is a safe distance when someone brings me...
4 5.2.Quarantine_visits Are quarantined visits allowed?
In [8]:
# Get total number of unique labels
data['label'].nunique()
Out[8]:
57
In [9]:
# Get unique labels with counts
data['label'].value_counts()
Out[9]:
Symptoms                               76
School_shutdown                        58
Testing_Locations                      54
Protecting_Against_Infection           42
Case_Count                             32
COVID_Description                      29
Testing_information                    28
Paid_Sick_Leave                        26
COVID_Comparing_Illness                25
4.2.Treatment_info                     23
1.2aboutVirus_Transmission             23
Go_Hospital_See_Doctor                 22
5.1.Quarantine_what_to_do              21
Unemployment_Application               19
Cleaning_Disinfecting                  17
Shopping_Groceries                     16
9.1masks_availability                  15
5.8.Quarantine_eligibility             15
COVID_Pets_Animals                     14
6.1govt_emergency                      14
Outside_Activities                     14
Travel_Restrictions                    14
4.1.Treatment_vaccine                  13
9.3masks_protection                    13
COVID_Illness_Length                   13
Event_Shutdowns                        12
COVID_Mortality_Rate_Likelihood        12
1.4aboutVirus_HighRiskGroups           11
13.4.About_Anezka_hate                 10
1.7aboutVirus_Incubation_Period         9
10.2contacts_handover                   9
Testing_Results                         9
13.1.About_Anezka_WhoAreYou             8
13.3.About_Anezka_Greeting              7
11.1volunteering                        7
7.4travel_risk_countries                7
5.4.Quarantine_living_with_others       7
5.5.Quarantine_social_aid               7
5.3.Quarantine_living_alone             7
5.2.Quarantine_visits                   6
7.2travel_backtoUS                      6
12.5schools_erasmus                     6
general_Help                            6
7.5travel_embassies                     5
8.3globalcovid_WHO                      5
1.6.aboutVirus_PostalServices           5
4.8.Treatment_reached_capacity          5
4.5.Treatment_hospital_capacity         5
2.3.Prevention_howto_wash_hands         5
1.5aboutVirus_USready                   5
4.10.Treatment_discovering_medicine     5
12.3schools_employees                   5
4.6.Treatment_shortnessOfBreath         4
13.2.About_Anezka_HowAreYou             4
8.2globalcovid_pandemic                 4
yes                                     3
general_ThankYou                        2
Name: label, dtype: int64
In [10]:
# Get unique labels with percentage
df_label = data['label'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
In [11]:
df_label = df_label.reset_index(name='counts')
In [12]:
df_label.head()
Out[12]:
index counts
0 Symptoms 9.0%
1 School_shutdown 6.9%
2 Testing_Locations 6.4%
3 Protecting_Against_Infection 5.0%
4 Case_Count 3.8%
In [13]:
df_label.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   57 non-null     object
 1   counts  57 non-null     object
dtypes: object(2)
memory usage: 1.0+ KB
In [14]:
df_label['counts'] = df_label['counts'].apply(lambda x: np.nan if x in ['-'] else x[:-1]).astype(float)/100
In [15]:
df_label.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   index   57 non-null     object 
 1   counts  57 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.0+ KB
In [16]:
format_dict = {'counts': '{:.2%}'}
df_label.style.format(format_dict).hide_index()
Out[16]:
index counts
Symptoms 9.00%
School_shutdown 6.90%
Testing_Locations 6.40%
Protecting_Against_Infection 5.00%
Case_Count 3.80%
COVID_Description 3.40%
Testing_information 3.30%
Paid_Sick_Leave 3.10%
COVID_Comparing_Illness 3.00%
4.2.Treatment_info 2.70%
1.2aboutVirus_Transmission 2.70%
Go_Hospital_See_Doctor 2.60%
5.1.Quarantine_what_to_do 2.50%
Unemployment_Application 2.30%
Cleaning_Disinfecting 2.00%
Shopping_Groceries 1.90%
9.1masks_availability 1.80%
5.8.Quarantine_eligibility 1.80%
COVID_Pets_Animals 1.70%
6.1govt_emergency 1.70%
Outside_Activities 1.70%
Travel_Restrictions 1.70%
4.1.Treatment_vaccine 1.50%
9.3masks_protection 1.50%
COVID_Illness_Length 1.50%
Event_Shutdowns 1.40%
COVID_Mortality_Rate_Likelihood 1.40%
1.4aboutVirus_HighRiskGroups 1.30%
13.4.About_Anezka_hate 1.20%
1.7aboutVirus_Incubation_Period 1.10%
10.2contacts_handover 1.10%
Testing_Results 1.10%
13.1.About_Anezka_WhoAreYou 0.90%
13.3.About_Anezka_Greeting 0.80%
11.1volunteering 0.80%
7.4travel_risk_countries 0.80%
5.4.Quarantine_living_with_others 0.80%
5.5.Quarantine_social_aid 0.80%
5.3.Quarantine_living_alone 0.80%
5.2.Quarantine_visits 0.70%
7.2travel_backtoUS 0.70%
12.5schools_erasmus 0.70%
general_Help 0.70%
7.5travel_embassies 0.60%
8.3globalcovid_WHO 0.60%
1.6.aboutVirus_PostalServices 0.60%
4.8.Treatment_reached_capacity 0.60%
4.5.Treatment_hospital_capacity 0.60%
2.3.Prevention_howto_wash_hands 0.60%
1.5aboutVirus_USready 0.60%
4.10.Treatment_discovering_medicine 0.60%
12.3schools_employees 0.60%
4.6.Treatment_shortnessOfBreath 0.50%
13.2.About_Anezka_HowAreYou 0.50%
8.2globalcovid_pandemic 0.50%
yes 0.40%
general_ThankYou 0.20%
In [17]:
# Get unique words in `Case_Count` category
stemmer = PorterStemmer()
# Tokenization of questions
data['tokenized'] = data['utterance'].apply(lambda x: word_tokenize(x)) 
# Stemming on tokenization
data['stemmed'] = data['tokenized'].apply(lambda x: [stemmer.stem(y) for y in x]) 
In [18]:
data.head()
Out[18]:
label utterance tokenized stemmed
0 5.2.Quarantine_visits Can my friends visit me? [Can, my, friends, visit, me, ?] [can, my, friend, visit, me, ?]
1 5.2.Quarantine_visits Can someone come see me during quarantine? [Can, someone, come, see, me, during, quaranti... [can, someon, come, see, me, dure, quarantin, ?]
2 5.2.Quarantine_visits Is anyone allowed to visit during quarantine? [Is, anyone, allowed, to, visit, during, quara... [Is, anyon, allow, to, visit, dure, quarantin, ?]
3 5.2.Quarantine_visits What is a safe distance when someone brings me... [What, is, a, safe, distance, when, someone, b... [what, is, a, safe, distanc, when, someon, bri...
4 5.2.Quarantine_visits Are quarantined visits allowed? [Are, quarantined, visits, allowed, ?] [are, quarantin, visit, allow, ?]
In [19]:
case_count_df = data.loc[data['label'] == 'Case_Count']
In [20]:
case_count_df.shape
Out[20]:
(32, 4)
In [21]:
results = set()
case_count_df['tokenized'].apply(results.update)
print(results)
{'what', 'detected', 'case', 'near', 'there', 'find', 'countries', 'are', 'world', 'covid-19', 'around', 'cases', 'been', 'folks', 'city', 'area', 'US', 'MA', 'Corona', 'coronavirus', 'state', 'COVID-19', 'pennysylvania', 'the', 'Which', 'virus', 'number', 'texas', 'information', 'covid', 'throughout', 'Are', '19', 'corona', 'count', 'of', 'currently', 'whole', 'current', '?', 'Does', 'contracted', 'ar', 'COVID', 'people', 'with', 'me', 'have', 'infected', 'about', 'us', 'spread', 'for', 'my', 'has', 'sick', 'how', 'Where', 'Texas', 'cornona', 'in', 'confirmed', 'patients', 'a', 'ppl', 'county', 'any', 'died', 'Illinois', 'I', 'is', 'anyone', 'How', 'our', 'collin', 'can', 'many', 'total'}