Concept Abstractness¶

During the last decades, the influence of psycholinguistic properties of words on cognitive processes has become a major topic of scientific inquiry. Among the most studied psycholinguistic attributes are concreteness, familiarity, imagery, and average age of acquisition. Abstractness quantifies the degree to which an expression denotes an entity that can be directly perceived by human senses. As an example, the word "feminism" is usually perceived as abstract, but the word "screwdriver" is associated with a concrete meaning.

We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The released dataset contains 300K Wikipedia concepts automatically rated for their degree of abstractness.

In this notebook we explore the Concept Abstractness dataset. The dataset can be obtained for free from the IBM Developer Data Asset Exchange.

Loading the Data¶

import sys
import requests
import tarfile
import re
from os import path
from pandas import DataFrame as df
# Downloading the dataset
# specifying the zip file name 
fname = 'concept-abstractness.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-concept-abstractness/1.0.2/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

3544847

# Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()

# Verifying the file was extracted properly
data_path = "prediction_unigrams.csv"
path.exists(data_path)

True

# load dataset into notebook
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
# Using pandas to read the data 
model = pd.read_csv(data_path)
# Preview the first 5 lines of the loaded data 
model.head()

# Check the size of the data
len(model)

99954

Data Visulization¶

# Data Visualization of abstractness distribution
#import matplotlib.pyplot as plt
import numpy as np
from matplotlib import pyplot as plt

# Generate Histogram
X = model['Score']
# An "interface" to matplotlib.axes.Axes.hist() method
n, bins, patches = plt.hist(x=X, bins='auto', color='#0504aa', alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Concept Abstractness')
plt.text(23, 45, r'$\mu=15, b=3$')
plt.rcParams['figure.dpi'] = 50
maxfreq = n.max()
# Set a clean upper y-axis limit.
plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)

/opt/conda/envs/Python36/lib/python3.6/site-packages/matplotlib/axes/_base.py:3610: MatplotlibDeprecationWarning: 
The `ymax` argument was deprecated in Matplotlib 3.0 and will be removed in 3.2. Use `top` instead.
  alternative='`top`', obj_type='argument')

(0.0, 2990.0)

Data Cleaning and Manipulation¶

# Load Testing Data
import sys
import requests
import tarfile
import re
from os import path
from pandas import DataFrame as df
# Downloading the dataset
# specifying the zip file name 
fname = 'thematic-clustering-of-sentences.tar.gz'
url = 'http://s3.us-south.cloud-object-storage.appdomain.cloud/dax-assets-dev/dax-thematic-clustering-of-sentences/1.0.0/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
# Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()
# Verifying the file was extracted properly
data_path = "dataset.csv"
path.exists(data_path)

True

# test data
text = pd.read_csv(data_path)
# Inspecting the fields
list(text.columns)

['Article Title', 'Sentence', 'SectionTitle', 'Article Link']

# Inspecting the dtypes
text.dtypes

Article Title    object
Sentence         object
SectionTitle     object
Article Link     object
dtype: object

# Extract the title of the data
text = list(text.groupby(['Article Title']).groups.keys())
len(text)

692

# create dictionary
dictionary = model.set_index('Concept').T.to_dict('list')

Prediction¶

# Load Packages
from nltk.tokenize import word_tokenize

# Python program to get average of a list 
def Average(lst): 
    return sum(lst) / len(lst)

# Calculate Prediction score of test data
def predict_abstract(model, text):
    # Extract all of the concepts in the model
    word_tokens = list(model['Concept'])
    df = pd.DataFrame(columns = ['Concept', 'prediction score'])
    for i in text:
        title_tokens = word_tokenize(i)
        filtered_sentence = [w for w in title_tokens if w in word_tokens]
        scores = []
        if len(filtered_sentence)>0:
            for j in filtered_sentence:
                x = dictionary[j]
                scores.append(x[0])
            # Take the average of multiple abstract concept scores
            score = Average(scores)
            df = df.append({'Concept': i, 'prediction score': score}, ignore_index=True)
    return df

# Show the top 20 lines of prediction score
df = predict_abstract(model, text)
df.head(20)

Explanation and interpretation of the data results¶

what does high scores means? What does low scores means? For example, "Child sexual abuse" has a comparatively high score 0.7156. It means this phrase is relatively abstract because it does not specific the location, details and persona of the event. On the other hand, the phrase "2009 NFL season" represents more concrete details.

	Concept	Score
0	aabavanan	0.224343
1	aabenraa	0.131792
2	aabey	0.300266
3	aachen	0.171785
4	aachtopf	0.089984

	Concept	prediction score
0	1920s in sociology	0.553010
1	1955 in the Vietnam War	0.423878
2	2009 NFL season	0.308886
3	A Trip to the Moon	0.423878
4	Account planning	0.542075
5	Adult comics	0.407683
6	Afghan presidential election, 2009	0.527030
7	Ancient Greek phonology	0.691283
8	Athabasca oil sands	0.377176
9	Automotive industry in the United States	0.423878
10	Balancing (international relations)	0.329495
11	Bentley–Ottmann algorithm	0.522952
12	Bra–ket notation	0.595207
13	Bride kidnapping	0.451548
14	British literature	0.574329
15	Brush mouse	0.336329
16	CAN bus	0.073388
17	Catholic peace traditions	0.733878
18	Censorship in the United Kingdom	0.423878
19	Child sexual abuse	0.715622