Data Source: https://developer.ibm.com/exchanges/data/all/claim-sentences-search/

Claims are short phrases that an argument aims to prove. The goal of the Claim Sentence Search task is to detect sentences containing claims in a large corpus, given a debatable topic or motion. The dataset contains results of the q_mc query – sentences containing a certain topic, as described in the paper – containing 1.49M sentences. In addition, the dataset contains a claim sentence test set containing 2.5k top predicted sentences of our model, along with their labels. The sentences were retrieved from Wikipedia 2017.

The dataset includes:

  • readme_mc_queries.txt – Readme of the claim sentence search results
  • readme_test_set.txt – Readme of the test set
  • q_mc_train.csv – Sentences retrieved by the q_mc query on 70 train topics
  • q_mc_heldout.csv – Sentences retrieved by the q_mc query on 30 heldout topics
  • q_mc_test.csv – Sentences retrieved by the q_mc query on 50 test topics
  • test_set.csv – Top predictions of our system along with their labels
In [1]:
from IPython.display import clear_output
In [2]:
!pip install wordcloud
clear_output()
In [3]:
!pip install gensim
clear_output()
In [4]:
!pip install --user -U nltk
clear_output()
In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

stemmer = SnowballStemmer('english')

import nltk
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to /home/dsxuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[5]:
True
In [6]:
!wget https://dax-cdn.cdn.appdomain.cloud/dax-claim-sentences-search/1.0.2/claim-sentences-search.tar.gz -O claim-sentences-search.tar.gz && tar -zxf claim-sentences-search.tar.gz; rm claim-sentences-search.tar.gz
--2020-03-26 23:20:35--  https://dax-cdn.cdn.appdomain.cloud/dax-claim-sentences-search/1.0.2/claim-sentences-search.tar.gz
Resolving dax-cdn.cdn.appdomain.cloud (dax-cdn.cdn.appdomain.cloud)... 23.194.112.82, 23.194.112.88, 2600:1404:1400:1::687c:3930, ...
Connecting to dax-cdn.cdn.appdomain.cloud (dax-cdn.cdn.appdomain.cloud)|23.194.112.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 121149322 (116M) [application/x-tar]
Saving to: ‘claim-sentences-search.tar.gz’

100%[======================================>] 121,149,322 8.82MB/s   in 14s    

2020-03-26 23:20:50 (8.46 MB/s) - ‘claim-sentences-search.tar.gz’ saved [121149322/121149322]

In [7]:
ls
ADJECTIVES.xlsx  LICENSE           readme_mc_queries.txt  test_set.csv
attribution/     LICENSE.txt       readme_test_set.txt    topics.csv
data/            q_mc_heldout.csv  README.txt
LEXICON_BG.txt   q_mc_test.csv     ReleaseNotes.txt
LEXICON_UG.txt   q_mc_train.csv    SEMANTIC_CLASSES.xlsx
In [8]:
# Load the training data
train_data = pd.read_csv('q_mc_train.csv')

Data Exploration and Data Visualization

In [9]:
# Shape of the training data
train_data.shape
Out[9]:
(844302, 7)
In [10]:
# Training data column names
train_data.columns
Out[10]:
Index(['id', 'topic', 'mc', 'sentence', 'suffix', 'prefix', 'url'], dtype='object')

Column name description

  1. id - the topic id, as specified in the appendix of the paper Towards an argumentative content search engine using weak supervision Ran Levy, Ben Bogin, Shai Gretz, Ranit Aharonov and Noam Slonim COLING 2018
  2. topic - the motion topic
  3. mc - the main Wikipedia concept of the topic
  4. sentence
  5. suffix - the suffix of the sentence following the mention of the mc
  6. prefix - the prefix of the sentence preceding the mention of the mc
  7. url - link to source Wikipedia article
In [11]:
# Get unique topic id
print(sorted(pd.unique(train_data['id'])))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70]
In [12]:
# Number of unique ids
len(pd.unique(train_data['id']))
Out[12]:
70
In [13]:
# Get unique topics
sorted(pd.unique(train_data['topic']))
Out[13]:
['Academic freedom is not absolute',
 'Assisted suicide should be a criminal offence',
 'Casinos bring more harm than good',
 'Chain stores bring more harm than good',
 'Child labor should be legalized',
 'Coaching brings more harm than good',
 'Ecotourism brings more harm than good',
 'Generic drugs bring more harm than good',
 'Global warming is a major threat',
 'Homeschooling brings more good than harm',
 'IKEA brings more harm than good',
 'Illegal immigration brings more harm than good',
 'Internet censorship brings more good than harm',
 'Internet cookies bring more harm than good',
 'Magnet schools bring more harm than good',
 'Mixed-use development is beneficial',
 'Nationalism does more harm than good',
 'Newspapers are outdated',
 'Operation Cast Lead was justified',
 'PayPal brings more good than harm',
 'Private universities bring more good than harm',
 'Quebecan independence is justified',
 'Reality television does more harm than good',
 'Religion does more harm than good',
 'Same sex marriage brings more good than harm',
 'Science is a major threat',
 'Security hackers do more harm than good',
 'Socialism brings more harm than good',
 'Suicide should be a criminal offence',
 'The 2003 invasion of Iraq was justified',
 'The Internet archive brings more harm than good',
 'The Israeli disengagement from Gaza brought more harm than good',
 'The alternative vote is advantageous',
 'The atomic bombings of Hiroshima and Nagasaki were justified',
 'The free market brings more good than harm',
 'The internet brings more harm than good',
 'The paralympic games bring more good than harm',
 'The right to strike brings more harm than good',
 'Tower blocks are advantageous',
 'Virtual reality brings more harm than good',
 'We should abandon Youtube',
 'We should abandon disposable diapers',
 'We should abolish intellectual property rights',
 'We should abolish standardized tests',
 'We should abolish the US Electoral College',
 'We should adopt libertarianism',
 'We should adopt vegetarianism',
 'We should ban Greyhound racing',
 'We should ban beauty contests',
 'We should ban boxing',
 'We should ban child actors',
 'We should ban cosmetic surgery',
 'We should ban extreme sports',
 'We should ban hate sites',
 'We should ban online advertising',
 'We should ban private military companies',
 'We should ban the sale of violent video games to minors',
 'We should disband ASEAN',
 'We should disband the United Nations',
 'We should end daylight saving times',
 'We should further exploit geothermal energy',
 'We should legalize doping in sport',
 'We should lower the age of consent',
 'We should not subsidize single parents',
 'We should privatize future energy production',
 'We should protect endangered species',
 'We should subsidize Habitat for Humanity International',
 'We should subsidize public art',
 'We should subsidize renewable energy',
 'We should subsidize student loans']
In [14]:
# Number of unique topics
len(pd.unique(train_data['topic']))
Out[14]:
70
In [15]:
# Number of data points under specific id and topic
train_data.groupby(['id','topic']).size().reset_index().rename(columns={0:'count'}).head()
Out[15]:
id topic count
0 1 We should ban the sale of violent video games ... 341
1 2 We should legalize doping in sport 703
2 3 We should ban boxing 24555
3 4 We should abolish intellectual property rights 6808
4 5 We should protect endangered species 6636
In [16]:
# Total number of words in the document
train_data['sentence'].apply(lambda x: len(x.split(' '))).sum()
Out[16]:
20222150
In [17]:
# Get unique main content 
main_content = pd.unique(train_data['mc'])
In [18]:
main_content
Out[18]:
array(['YouTube', 'Video game controversies', 'Libertarianism',
       'Global warming', 'Assisted suicide', 'Gaza War (2008–09)',
       'Tower block', 'Homeschooling', 'Beauty pageant', 'Ecotourism',
       'Nationalism', 'Atomic bombings of Hiroshima and Nagasaki',
       'Academic freedom', 'Casino', 'Doping in sport', 'United Nations',
       'Age of consent', 'Hate speech', 'Energy development',
       'Independence', 'Child labour', 'Science', 'Standardized test',
       'Boxing', 'Vegetarianism', 'Paralympic Games', 'Chain store',
       'Habitat for Humanity', 'Plastic surgery', 'Public art', 'IKEA',
       'Newspaper', 'Online advertising', 'Internet',
       'Mixed-use development', 'Greyhound racing', 'Illegal immigration',
       'Private university', 'Renewable energy',
       'Israeli disengagement from Gaza',
       'Association of Southeast Asian Nations', 'Daylight saving time',
       'Single parent', 'Intellectual property', 'Geothermal energy',
       'Private military company', 'Extreme sport', 'Generic drug',
       'Coaching', 'Electoral College (United States)', 'Diaper',
       'Endangered species', 'Suicide', 'PayPal', 'Internet Archive',
       '2003 invasion of Iraq', 'Virtual reality', 'Free market',
       'Same-sex marriage', 'Security hacker', 'HTTP cookie',
       'Reality television', 'Internet censorship', 'Magnet school',
       'Socialism', 'Strike action', 'Child actor', 'Religion',
       'Student loan', 'Instant-runoff voting'], dtype=object)
In [19]:
plt.figure(figsize=(20,10))
train_data.mc.value_counts().plot(kind='bar');
In [20]:
# Get all data points under the topic HTTP Cookies
http_cookies_data = train_data[train_data['mc'] == 'HTTP cookie']
In [21]:
# Total data points under the topic
http_cookies_data.shape
Out[21]:
(220, 7)
In [22]:
http_cookies_data.head()
Out[22]:
id topic mc sentence suffix prefix url
662388 58 Internet cookies bring more harm than good HTTP cookie The ''Common Domain Cookie'' is a secure brows... scoped to the common domain. The ''Common Domain Cookie'' is a secure https://en.wikipedia.org/wiki/SAML_2.0
662389 58 Internet cookies bring more harm than good HTTP cookie This was possible even if a customer enabled p... [REF]. This was possible even if a customer enabled p... https://en.wikipedia.org/wiki/Dell
662390 58 Internet cookies bring more harm than good HTTP cookie Zedo uses HTTP cookies to track users' browsin... to track users' browsing and advertisement vi... Zedo uses https://en.wikipedia.org/wiki/Zedo
662391 58 Internet cookies bring more harm than good HTTP cookie Zedo uses an HTTP cookie to track users' brows... to track users' browsing history resulting in... Zedo uses https://en.wikipedia.org/wiki/Zedo
662392 58 Internet cookies bring more harm than good HTTP cookie The browser supports most common web technolog... , forms, CSS, as well as basic JavaScript capa... The browser supports most common web technolog... https://en.wikipedia.org/wiki/PlayStation_Port...
In [23]:
# Total number of words 
print(http_cookies_data['sentence'].apply(lambda x: len(x.split(' '))).sum())
5258
In [24]:
def input_preprocess(inp_sentence):
    processed_result = []
    for token in gensim.utils.simple_preprocess(inp_sentence):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            processed_result.append(token)
    return processed_result
In [25]:
# Removing stop words from the sentences
processed_http_cookies = http_cookies_data['sentence'].map(input_preprocess)
In [26]:
token_http_cookies = []
for i in processed_http_cookies:
    for j in i:
        token_http_cookies.append(j)
In [27]:
# Number of unique words after preprocessing
len(set(token_http_cookies))
Out[27]:
1093
In [28]:
unique_tokens_http = set(token_http_cookies)

WordCloud creation

In [29]:
unique_string=(" ").join(unique_tokens_http)
wordcloud = WordCloud(width = 1000, height = 500, background_color="white").generate(unique_string)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()

Topic Modelling

In [30]:
# Pre-process documents to remove stop words 
processed_docs = train_data['sentence'].map(input_preprocess)
In [31]:
processed_docs[:5]
Out[31]:
0    [second, annual, sidemen, charity, match, held...
1    [videos, posted, sites, youtube, generate, con...
2    [december, albarn, released, second, teaser, t...
3    [videos, later, freely, offered, tankian, webs...
4    [november, vidster, user, youtube, created, co...
Name: sentence, dtype: object
In [32]:
# Create dictionary which contain mapping between words and integer ids
dictionary = gensim.corpora.Dictionary(processed_docs)
In [33]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break
0 annual
1 charity
2 held
3 match
4 second
5 sidemen
6 stars
7 taking
8 youtube
9 armenian
10 azerbaijani
In [34]:
# Total number of unique words in the dictionary
count = 0
for k, v in dictionary.iteritems():
    count += 1
    
print('Total number of words in the dictionary: ', count)
Total number of words in the dictionary:  278588
In [35]:
# Filter out tokens in the dictionary by the frequency
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=10000)
In [36]:
count = 0
for k, v in dictionary.iteritems():
    count += 1
    
print('Total number of words in the dictionary after filtering process: ', count)
Total number of words in the dictionary after filtering process:  10000
In [37]:
# Convert document into bag-of-words format. This results in a tuple containing token_id and token_count
# The output indicates that a token with a specific id appears specified number of time
# For example: (7,1) indicates token with id 7 appears 1 time.
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[100]
Out[37]:
[(7, 1), (45, 1), (75, 1), (103, 1), (475, 1), (476, 1)]
In [ ]:
# LDA model 
# parameters provided are document vector, number of requested latent topics, dictionary containing mapping from words IDs to word,
# and number of passes through the corpus.
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=1)
In [ ]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))