SimpleQuestions Relation Detection Notebook

This notebook explores the SimpleQuestions Relation Detection dataset which is available to download freely on the IBM Developer Data Asset Exchange: SimpleQuestions Relation Detection Dataset. This notebook can be found on Watson Studio: SimpleQuestions Relation Detection Notebook.

The SimpleQuestions dataset is a set of relation extraction annotations derived from the SimpleQuestions dataset. Each entry in this dataset follows the order of questions listed in the SimpleQuestions dataset and corresponds to the following format: gold_relations \t negative_relation_pool \t question. The relation ids are mapped in a separate file titled relation.2M.list where the index of the ids starts at 1. The dataset is split into train, validation, and test sets to match the split used by the SimpleQuestions data.

The relationship extraction task deals with generating semantic relationships between entities in a text. Relationships generally connect two entities via a certain affiliation. Examples of entities for instance can be types of people, organizations, or locations while relationships among these entities can be for instance types of spatial, social, or hierarchical relations. The entities “Steve Jobs” and “Apple” for instance may have the relation of “Founder”. Relation extraction is important in the field of machine reading and provides a necessary input into more complicated tasks for computers such as answering questions, acting as conversational agents, or summarizing text.

The original SimpleQuestions dataset was developed by Facebook and consists of 108,442 simple questions written by human English-speaking annotators. Each question is matched with an answer in the form of a fact consisting of a subject, relationship, and object. For more information or for access to the original SimpleQuestions dataset you can visit the dataset’s repository linked below in the Related Links section.

In [1]:
# Download & load required python packages
import requests
import hashlib
from pathlib import Path
import tarfile
import os
import pandas as pd

Download and Extract the Dataset

Lets download the dataset from the Data Asset Exchange Cloud Object Storage bucket and extract the tarball.

In [2]:
# Download the dataset
fname = 'simple-questions-relation-detection.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-simple-questions-relation-detection/1.0.0/' + fname
r = requests.get(url)
Path(fname).write_bytes(r.content)
Out[2]:
2164770
In [3]:
# Extract the dataset
with tarfile.open(fname) as tar:
    tar.extractall()
In [4]:
# Verify the file was extracted properly by comparing sha512 checksums
sha512sum = 'ee589dbcfe19494c34323ab5777353d833cc6ce42412d6350f1a9da4abcfd39b4d1b61026d1059925852ffec46811a6e2fd6147ba590aea2fb4dbea40a5b87e8'
sha512sum_computed = hashlib.sha512(Path(fname).read_bytes()).hexdigest()
sha512sum == sha512sum_computed
Out[4]:
True

Read Data into Python Objects

In [5]:
os.listdir('simple-questions-relation-detection')
Out[5]:
['test.replace_ne.withpool',
 'LICENSE.txt',
 'README.txt',
 'valid.replace_ne.withpool',
 'relation.2M.list',
 'train.replace_ne.withpool']
In [6]:
# Relations
relations_df = pd.read_csv('simple-questions-relation-detection/relation.2M.list', names=['relation'])
relations_df.index += 1  # relation ids begin at 1
relations_df.head()
Out[6]:
relation
1 /people/person/gender
2 /music/release/track
3 /soccer/football_player/position_s
4 /organization/organization_founder/organizatio...
5 /cvg/computer_videogame/cvg_genre
In [7]:
# SimpleQuestions
valid_df = pd.read_csv('simple-questions-relation-detection/valid.replace_ne.withpool', 
                       sep='\t', 
                       names=['gold_relations', 'negative_relation_pool', 'question'])

test_df = pd.read_csv('simple-questions-relation-detection/test.replace_ne.withpool', 
                       sep='\t', 
                       names=['gold_relations', 'negative_relation_pool', 'question'])

train_df = pd.read_csv('simple-questions-relation-detection/train.replace_ne.withpool', 
                       sep='\t', 
                       names=['gold_relations', 'negative_relation_pool', 'question'])
In [8]:
train_df.head()
Out[8]:
gold_relations negative_relation_pool question
0 67 103 42 what is the book #head_entity# about
1 46 22 to what release does the release track #head_e...
2 66 9 15 21 what country was the film #head_entity# from
3 547 54 185 62 38 561 249 188 189 549 1 226 6 338 9... what songs have #head_entity# produced ?
4 368 31 138 who produced #head_entity# ?
In [63]:
# Grab random row from training set and print its relations
gold_id, neg_ids, question = train_df.sample(1).values[0]
gold_rel = relations_df.loc[gold_id].values[0]
neg_rels = [relations_df.loc[int(x)].values[0] for x in neg_ids.split()]

print(f"""
Question: {question}
Gold relation: {gold_rel}
Negative relations: {neg_rels}
""")
Question: where was #head_entity# born at ?
Gold relation: /people/person/place_of_birth
Negative relations: ['/people/person/gender', '/people/person/profession', '/people/person/nationality']