COVID-19 Questions¶

This notebook explores the COVID-19 Questions dataset. The dataset containing categorized questions which were frequently asked by the public during the COVID-19 pandemic period.

The dataset is available to download freely on the IBM Developer Data Asset eXchange: COVID-19 Questions. This notebook can be found on Watson Studio: COVID-19 Questions Notebook

# Import Packages
import pandas as pd
import nltk
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
nltk.download('punkt')
import plotly.express as px
import requests
import tarfile
from os import path

[nltk_data] Downloading package punkt to /home/wsuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Download and Extract the Dataset¶

fname = 'covid-19-questions.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-covid-19-questions/1.0.0/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

14900

# Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()

# Verifying the file was extracted properly
data_path = "covid-19-questions/"
path.exists(data_path)

True

Explore Data¶

data = pd.read_csv('covid-19-questions/covid_19_questions.tsv', sep='\t')

# shape of the data
data.shape

(844, 2)

# Display first 5 rows
data.head()

# Get total number of unique labels
data['label'].nunique()

57

# Get unique labels with counts
data['label'].value_counts()

Symptoms                               76
School_shutdown                        58
Testing_Locations                      54
Protecting_Against_Infection           42
Case_Count                             32
COVID_Description                      29
Testing_information                    28
Paid_Sick_Leave                        26
COVID_Comparing_Illness                25
4.2.Treatment_info                     23
1.2aboutVirus_Transmission             23
Go_Hospital_See_Doctor                 22
5.1.Quarantine_what_to_do              21
Unemployment_Application               19
Cleaning_Disinfecting                  17
Shopping_Groceries                     16
9.1masks_availability                  15
5.8.Quarantine_eligibility             15
COVID_Pets_Animals                     14
6.1govt_emergency                      14
Outside_Activities                     14
Travel_Restrictions                    14
4.1.Treatment_vaccine                  13
9.3masks_protection                    13
COVID_Illness_Length                   13
Event_Shutdowns                        12
COVID_Mortality_Rate_Likelihood        12
1.4aboutVirus_HighRiskGroups           11
13.4.About_Anezka_hate                 10
1.7aboutVirus_Incubation_Period         9
10.2contacts_handover                   9
Testing_Results                         9
13.1.About_Anezka_WhoAreYou             8
13.3.About_Anezka_Greeting              7
11.1volunteering                        7
7.4travel_risk_countries                7
5.4.Quarantine_living_with_others       7
5.5.Quarantine_social_aid               7
5.3.Quarantine_living_alone             7
5.2.Quarantine_visits                   6
7.2travel_backtoUS                      6
12.5schools_erasmus                     6
general_Help                            6
7.5travel_embassies                     5
8.3globalcovid_WHO                      5
1.6.aboutVirus_PostalServices           5
4.8.Treatment_reached_capacity          5
4.5.Treatment_hospital_capacity         5
2.3.Prevention_howto_wash_hands         5
1.5aboutVirus_USready                   5
4.10.Treatment_discovering_medicine     5
12.3schools_employees                   5
4.6.Treatment_shortnessOfBreath         4
13.2.About_Anezka_HowAreYou             4
8.2globalcovid_pandemic                 4
yes                                     3
general_ThankYou                        2
Name: label, dtype: int64

# Get unique labels with percentage
df_label = data['label'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

df_label = df_label.reset_index(name='counts')

df_label.head()

df_label.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   57 non-null     object
 1   counts  57 non-null     object
dtypes: object(2)
memory usage: 1.0+ KB

df_label['counts'] = df_label['counts'].apply(lambda x: np.nan if x in ['-'] else x[:-1]).astype(float)/100

df_label.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   index   57 non-null     object 
 1   counts  57 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.0+ KB

format_dict = {'counts': '{:.2%}'}
df_label.style.format(format_dict).hide_index()

# Get unique words in `Case_Count` category
stemmer = PorterStemmer()
# Tokenization of questions
data['tokenized'] = data['utterance'].apply(lambda x: word_tokenize(x)) 
# Stemming on tokenization
data['stemmed'] = data['tokenized'].apply(lambda x: [stemmer.stem(y) for y in x])

data.head()

case_count_df = data.loc[data['label'] == 'Case_Count']

case_count_df.shape

(32, 4)

results = set()
case_count_df['tokenized'].apply(results.update)
print(results)

{'what', 'detected', 'case', 'near', 'there', 'find', 'countries', 'are', 'world', 'covid-19', 'around', 'cases', 'been', 'folks', 'city', 'area', 'US', 'MA', 'Corona', 'coronavirus', 'state', 'COVID-19', 'pennysylvania', 'the', 'Which', 'virus', 'number', 'texas', 'information', 'covid', 'throughout', 'Are', '19', 'corona', 'count', 'of', 'currently', 'whole', 'current', '?', 'Does', 'contracted', 'ar', 'COVID', 'people', 'with', 'me', 'have', 'infected', 'about', 'us', 'spread', 'for', 'my', 'has', 'sick', 'how', 'Where', 'Texas', 'cornona', 'in', 'confirmed', 'patients', 'a', 'ppl', 'county', 'any', 'died', 'Illinois', 'I', 'is', 'anyone', 'How', 'our', 'collin', 'can', 'many', 'total'}

	label	utterance	tokenized	stemmed
0	5.2.Quarantine_visits	Can my friends visit me?	[Can, my, friends, visit, me, ?]	[can, my, friend, visit, me, ?]
1	5.2.Quarantine_visits	Can someone come see me during quarantine?	[Can, someone, come, see, me, during, quaranti...	[can, someon, come, see, me, dure, quarantin, ?]
2	5.2.Quarantine_visits	Is anyone allowed to visit during quarantine?	[Is, anyone, allowed, to, visit, during, quara...	[Is, anyon, allow, to, visit, dure, quarantin, ?]
3	5.2.Quarantine_visits	What is a safe distance when someone brings me...	[What, is, a, safe, distance, when, someone, b...	[what, is, a, safe, distanc, when, someon, bri...
4	5.2.Quarantine_visits	Are quarantined visits allowed?	[Are, quarantined, visits, allowed, ?]	[are, quarantin, visit, allow, ?]

	index	counts
0	Symptoms	9.0%
1	School_shutdown	6.9%
2	Testing_Locations	6.4%
3	Protecting_Against_Infection	5.0%
4	Case_Count	3.8%

index	counts
Symptoms	9.00%
School_shutdown	6.90%
Testing_Locations	6.40%
Protecting_Against_Infection	5.00%
Case_Count	3.80%
COVID_Description	3.40%
Testing_information	3.30%
Paid_Sick_Leave	3.10%
COVID_Comparing_Illness	3.00%
4.2.Treatment_info	2.70%
1.2aboutVirus_Transmission	2.70%
Go_Hospital_See_Doctor	2.60%
5.1.Quarantine_what_to_do	2.50%
Unemployment_Application	2.30%
Cleaning_Disinfecting	2.00%
Shopping_Groceries	1.90%
9.1masks_availability	1.80%
5.8.Quarantine_eligibility	1.80%
COVID_Pets_Animals	1.70%
6.1govt_emergency	1.70%
Outside_Activities	1.70%
Travel_Restrictions	1.70%
4.1.Treatment_vaccine	1.50%
9.3masks_protection	1.50%
COVID_Illness_Length	1.50%
Event_Shutdowns	1.40%
COVID_Mortality_Rate_Likelihood	1.40%
1.4aboutVirus_HighRiskGroups	1.30%
13.4.About_Anezka_hate	1.20%
1.7aboutVirus_Incubation_Period	1.10%
10.2contacts_handover	1.10%
Testing_Results	1.10%
13.1.About_Anezka_WhoAreYou	0.90%
13.3.About_Anezka_Greeting	0.80%
11.1volunteering	0.80%
7.4travel_risk_countries	0.80%
5.4.Quarantine_living_with_others	0.80%
5.5.Quarantine_social_aid	0.80%
5.3.Quarantine_living_alone	0.80%
5.2.Quarantine_visits	0.70%
7.2travel_backtoUS	0.70%
12.5schools_erasmus	0.70%
general_Help	0.70%
7.5travel_embassies	0.60%
8.3globalcovid_WHO	0.60%
1.6.aboutVirus_PostalServices	0.60%
4.8.Treatment_reached_capacity	0.60%
4.5.Treatment_hospital_capacity	0.60%
2.3.Prevention_howto_wash_hands	0.60%
1.5aboutVirus_USready	0.60%
4.10.Treatment_discovering_medicine	0.60%
12.3schools_employees	0.60%
4.6.Treatment_shortnessOfBreath	0.50%
13.2.About_Anezka_HowAreYou	0.50%
8.2globalcovid_pandemic	0.50%
yes	0.40%
general_ThankYou	0.20%