IBM Debater® Claim Sentences Search¶

Data Source: https://developer.ibm.com/exchanges/data/all/claim-sentences-search/

Claims are short phrases that an argument aims to prove. The goal of the Claim Sentence Search task is to detect sentences containing claims in a large corpus, given a debatable topic or motion. The dataset contains results of the q_mc query – sentences containing a certain topic, as described in the paper – containing 1.49M sentences. In addition, the dataset contains a claim sentence test set containing 2.5k top predicted sentences of our model, along with their labels. The sentences were retrieved from Wikipedia 2017.

The dataset includes:

readme_mc_queries.txt – Readme of the claim sentence search results
readme_test_set.txt – Readme of the test set
q_mc_train.csv – Sentences retrieved by the q_mc query on 70 train topics
q_mc_heldout.csv – Sentences retrieved by the q_mc query on 30 heldout topics
q_mc_test.csv – Sentences retrieved by the q_mc query on 50 test topics
test_set.csv – Top predictions of our system along with their labels

from IPython.display import clear_output

!pip install wordcloud
clear_output()

!pip install gensim
clear_output()

!pip install --user -U nltk
clear_output()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

stemmer = SnowballStemmer('english')

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/dsxuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

True

!wget https://dax-cdn.cdn.appdomain.cloud/dax-claim-sentences-search/1.0.2/claim-sentences-search.tar.gz -O claim-sentences-search.tar.gz && tar -zxf claim-sentences-search.tar.gz; rm claim-sentences-search.tar.gz

--2020-03-26 23:20:35--  https://dax-cdn.cdn.appdomain.cloud/dax-claim-sentences-search/1.0.2/claim-sentences-search.tar.gz
Resolving dax-cdn.cdn.appdomain.cloud (dax-cdn.cdn.appdomain.cloud)... 23.194.112.82, 23.194.112.88, 2600:1404:1400:1::687c:3930, ...
Connecting to dax-cdn.cdn.appdomain.cloud (dax-cdn.cdn.appdomain.cloud)|23.194.112.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 121149322 (116M) [application/x-tar]
Saving to: ‘claim-sentences-search.tar.gz’

100%[======================================>] 121,149,322 8.82MB/s   in 14s    

2020-03-26 23:20:50 (8.46 MB/s) - ‘claim-sentences-search.tar.gz’ saved [121149322/121149322]

ls

ADJECTIVES.xlsx  LICENSE           readme_mc_queries.txt  test_set.csv
attribution/     LICENSE.txt       readme_test_set.txt    topics.csv
data/            q_mc_heldout.csv  README.txt
LEXICON_BG.txt   q_mc_test.csv     ReleaseNotes.txt
LEXICON_UG.txt   q_mc_train.csv    SEMANTIC_CLASSES.xlsx

# Load the training data
train_data = pd.read_csv('q_mc_train.csv')

Data Exploration and Data Visualization¶

# Shape of the training data
train_data.shape

(844302, 7)

# Training data column names
train_data.columns

Index(['id', 'topic', 'mc', 'sentence', 'suffix', 'prefix', 'url'], dtype='object')

Column name description¶

id - the topic id, as specified in the appendix of the paper Towards an argumentative content search engine using weak supervision Ran Levy, Ben Bogin, Shai Gretz, Ranit Aharonov and Noam Slonim COLING 2018
topic - the motion topic
mc - the main Wikipedia concept of the topic
sentence
suffix - the suffix of the sentence following the mention of the mc
prefix - the prefix of the sentence preceding the mention of the mc
url - link to source Wikipedia article

# Get unique topic id
print(sorted(pd.unique(train_data['id'])))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70]

# Number of unique ids
len(pd.unique(train_data['id']))

70

# Get unique topics
sorted(pd.unique(train_data['topic']))

['Academic freedom is not absolute',
 'Assisted suicide should be a criminal offence',
 'Casinos bring more harm than good',
 'Chain stores bring more harm than good',
 'Child labor should be legalized',
 'Coaching brings more harm than good',
 'Ecotourism brings more harm than good',
 'Generic drugs bring more harm than good',
 'Global warming is a major threat',
 'Homeschooling brings more good than harm',
 'IKEA brings more harm than good',
 'Illegal immigration brings more harm than good',
 'Internet censorship brings more good than harm',
 'Internet cookies bring more harm than good',
 'Magnet schools bring more harm than good',
 'Mixed-use development is beneficial',
 'Nationalism does more harm than good',
 'Newspapers are outdated',
 'Operation Cast Lead was justified',
 'PayPal brings more good than harm',
 'Private universities bring more good than harm',
 'Quebecan independence is justified',
 'Reality television does more harm than good',
 'Religion does more harm than good',
 'Same sex marriage brings more good than harm',
 'Science is a major threat',
 'Security hackers do more harm than good',
 'Socialism brings more harm than good',
 'Suicide should be a criminal offence',
 'The 2003 invasion of Iraq was justified',
 'The Internet archive brings more harm than good',
 'The Israeli disengagement from Gaza brought more harm than good',
 'The alternative vote is advantageous',
 'The atomic bombings of Hiroshima and Nagasaki were justified',
 'The free market brings more good than harm',
 'The internet brings more harm than good',
 'The paralympic games bring more good than harm',
 'The right to strike brings more harm than good',
 'Tower blocks are advantageous',
 'Virtual reality brings more harm than good',
 'We should abandon Youtube',
 'We should abandon disposable diapers',
 'We should abolish intellectual property rights',
 'We should abolish standardized tests',
 'We should abolish the US Electoral College',
 'We should adopt libertarianism',
 'We should adopt vegetarianism',
 'We should ban Greyhound racing',
 'We should ban beauty contests',
 'We should ban boxing',
 'We should ban child actors',
 'We should ban cosmetic surgery',
 'We should ban extreme sports',
 'We should ban hate sites',
 'We should ban online advertising',
 'We should ban private military companies',
 'We should ban the sale of violent video games to minors',
 'We should disband ASEAN',
 'We should disband the United Nations',
 'We should end daylight saving times',
 'We should further exploit geothermal energy',
 'We should legalize doping in sport',
 'We should lower the age of consent',
 'We should not subsidize single parents',
 'We should privatize future energy production',
 'We should protect endangered species',
 'We should subsidize Habitat for Humanity International',
 'We should subsidize public art',
 'We should subsidize renewable energy',
 'We should subsidize student loans']

# Number of unique topics
len(pd.unique(train_data['topic']))

70

# Number of data points under specific id and topic
train_data.groupby(['id','topic']).size().reset_index().rename(columns={0:'count'}).head()

# Total number of words in the document
train_data['sentence'].apply(lambda x: len(x.split(' '))).sum()

20222150

# Get unique main content 
main_content = pd.unique(train_data['mc'])

main_content

array(['YouTube', 'Video game controversies', 'Libertarianism',
       'Global warming', 'Assisted suicide', 'Gaza War (2008–09)',
       'Tower block', 'Homeschooling', 'Beauty pageant', 'Ecotourism',
       'Nationalism', 'Atomic bombings of Hiroshima and Nagasaki',
       'Academic freedom', 'Casino', 'Doping in sport', 'United Nations',
       'Age of consent', 'Hate speech', 'Energy development',
       'Independence', 'Child labour', 'Science', 'Standardized test',
       'Boxing', 'Vegetarianism', 'Paralympic Games', 'Chain store',
       'Habitat for Humanity', 'Plastic surgery', 'Public art', 'IKEA',
       'Newspaper', 'Online advertising', 'Internet',
       'Mixed-use development', 'Greyhound racing', 'Illegal immigration',
       'Private university', 'Renewable energy',
       'Israeli disengagement from Gaza',
       'Association of Southeast Asian Nations', 'Daylight saving time',
       'Single parent', 'Intellectual property', 'Geothermal energy',
       'Private military company', 'Extreme sport', 'Generic drug',
       'Coaching', 'Electoral College (United States)', 'Diaper',
       'Endangered species', 'Suicide', 'PayPal', 'Internet Archive',
       '2003 invasion of Iraq', 'Virtual reality', 'Free market',
       'Same-sex marriage', 'Security hacker', 'HTTP cookie',
       'Reality television', 'Internet censorship', 'Magnet school',
       'Socialism', 'Strike action', 'Child actor', 'Religion',
       'Student loan', 'Instant-runoff voting'], dtype=object)

plt.figure(figsize=(20,10))
train_data.mc.value_counts().plot(kind='bar');

Explore `HTTP Cookie` topic¶

# Get all data points under the topic HTTP Cookies
http_cookies_data = train_data[train_data['mc'] == 'HTTP cookie']

# Total data points under the topic
http_cookies_data.shape

(220, 7)

http_cookies_data.head()

# Total number of words 
print(http_cookies_data['sentence'].apply(lambda x: len(x.split(' '))).sum())

5258

def input_preprocess(inp_sentence):
    processed_result = []
    for token in gensim.utils.simple_preprocess(inp_sentence):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            processed_result.append(token)
    return processed_result

# Removing stop words from the sentences
processed_http_cookies = http_cookies_data['sentence'].map(input_preprocess)

token_http_cookies = []
for i in processed_http_cookies:
    for j in i:
        token_http_cookies.append(j)

# Number of unique words after preprocessing
len(set(token_http_cookies))

1093

unique_tokens_http = set(token_http_cookies)

WordCloud creation¶

unique_string=(" ").join(unique_tokens_http)
wordcloud = WordCloud(width = 1000, height = 500, background_color="white").generate(unique_string)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()

Topic Modelling¶

# Pre-process documents to remove stop words 
processed_docs = train_data['sentence'].map(input_preprocess)

processed_docs[:5]

0    [second, annual, sidemen, charity, match, held...
1    [videos, posted, sites, youtube, generate, con...
2    [december, albarn, released, second, teaser, t...
3    [videos, later, freely, offered, tankian, webs...
4    [november, vidster, user, youtube, created, co...
Name: sentence, dtype: object

# Create dictionary which contain mapping between words and integer ids
dictionary = gensim.corpora.Dictionary(processed_docs)

count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 annual
1 charity
2 held
3 match
4 second
5 sidemen
6 stars
7 taking
8 youtube
9 armenian
10 azerbaijani

# Total number of unique words in the dictionary
count = 0
for k, v in dictionary.iteritems():
    count += 1
    
print('Total number of words in the dictionary: ', count)

Total number of words in the dictionary:  278588

# Filter out tokens in the dictionary by the frequency
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=10000)

count = 0
for k, v in dictionary.iteritems():
    count += 1
    
print('Total number of words in the dictionary after filtering process: ', count)

Total number of words in the dictionary after filtering process:  10000

# Convert document into bag-of-words format. This results in a tuple containing token_id and token_count
# The output indicates that a token with a specific id appears specified number of time
# For example: (7,1) indicates token with id 7 appears 1 time.
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[100]

[(7, 1), (45, 1), (75, 1), (103, 1), (475, 1), (476, 1)]

# LDA model 
# parameters provided are document vector, number of requested latent topics, dictionary containing mapping from words IDs to word,
# and number of passes through the corpus.
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=1)

for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

	id	topic	mc	sentence	suffix	prefix	url
662388	58	Internet cookies bring more harm than good	HTTP cookie	The ''Common Domain Cookie'' is a secure brows...	scoped to the common domain.	The ''Common Domain Cookie'' is a secure	https://en.wikipedia.org/wiki/SAML_2.0
662389	58	Internet cookies bring more harm than good	HTTP cookie	This was possible even if a customer enabled p...	[REF].	This was possible even if a customer enabled p...	https://en.wikipedia.org/wiki/Dell
662390	58	Internet cookies bring more harm than good	HTTP cookie	Zedo uses HTTP cookies to track users' browsin...	to track users' browsing and advertisement vi...	Zedo uses	https://en.wikipedia.org/wiki/Zedo
662391	58	Internet cookies bring more harm than good	HTTP cookie	Zedo uses an HTTP cookie to track users' brows...	to track users' browsing history resulting in...	Zedo uses	https://en.wikipedia.org/wiki/Zedo
662392	58	Internet cookies bring more harm than good	HTTP cookie	The browser supports most common web technolog...	, forms, CSS, as well as basic JavaScript capa...	The browser supports most common web technolog...	https://en.wikipedia.org/wiki/PlayStation_Port...