IBM Debater® Wikipedia Oriented Relatedness

Wikipedia is a very popular source of encyclopedic knowledge which provides highly reliable articles in a variety of domains. This richness and popularity created a strong motivation among NLP researchers to develop relatedness measures between Wikipedia concepts.

WORD (Wikipedia Oriented Relatedness Dataset) is a new type of concept relatedness dataset, composed of 19,276 pairs of Wikipedia concepts. This is the first human annotated dataset of Wikipedia concepts, whose purpose is twofold. On the one hand, it can serve as a benchmark for evaluating concept-relatedness methods. On the other hand, it can be used as supervised data for developing new models for concept relatedness prediction. Among the advantages of this dataset compared to its term-relatedness counterparts, are its built-in disambiguation solution, and its richness with meaningful multiword terms. Based on this benchmark we developed a new tool, named WORT (Wikipedia Oriented Relatedness Tool), for measuring the level of relatedness between pairs of concepts.

Load Dataset

In [1]:
import requests # External dependency: pip install requests
import tarfile

# Downloading the dataset
url_base = 'https://dax-cdn.cdn.appdomain.cloud/dax-wikipedia-oriented-relatedness'
version = '1.0.2'
fname = 'wikipedia-oriented-relatedness.tar.gz'
url = "{}/{}/{}".format(url_base, version, fname)
r = requests.get(url)
if not r.ok:
    print("There are some errors when downloading {}".format(url))
In [2]:
with open(fname, 'wb') as f:
    f.write(r.content)

# Extracting the dataset
with tarfile.open(fname) as f:
    f.extractall()

Read Dataset

In [3]:
import os
import pandas as pd # External dependency: pip install pandas

data_path = "WORD.csv"
if not os.access(data_path, os.R_OK):
    print("Failed to read the target file: {}".format(data_path))
In [4]:
data = pd.read_csv(data_path, encoding="ISO-8859-1")
data.head(20)
Out[4]:
source article URI concept 1 concept 2 score concept1 URI concept2 URI Train/Test
0 https://en.wikipedia.org/wiki/Video_game_contr... The One (magazine) Television program 0.100000 https://en.wikipedia.org/wiki/The_One_(magazine) https://en.wikipedia.org/wiki/Television_program Train
1 https://en.wikipedia.org/wiki/Video_game_contr... Japanese language Music 0.000000 https://en.wikipedia.org/wiki/Japanese_language https://en.wikipedia.org/wiki/Music Train
2 https://en.wikipedia.org/wiki/Video_game_contr... Level (video gaming) Guinness World Records 1.000000 https://en.wikipedia.org/wiki/Level_(video_gam... https://en.wikipedia.org/wiki/Guinness_World_R... Train
3 https://en.wikipedia.org/wiki/Video_game_contr... Entertainment Murder 0.000000 https://en.wikipedia.org/wiki/Entertainment https://en.wikipedia.org/wiki/Murder Train
4 https://en.wikipedia.org/wiki/Video_game_contr... Nintendo 64 Sony Interactive Entertainment 0.500000 https://en.wikipedia.org/wiki/Nintendo_64 https://en.wikipedia.org/wiki/Sony_Interactive... Train
5 https://en.wikipedia.org/wiki/Video_game_contr... Survival horror Commodore 64 0.000000 https://en.wikipedia.org/wiki/Survival_horror https://en.wikipedia.org/wiki/Commodore_64 Train
6 https://en.wikipedia.org/wiki/Video_game_contr... Golden age (metaphor) Live action 0.000000 https://en.wikipedia.org/wiki/Golden_age_(meta... https://en.wikipedia.org/wiki/Live_action Train
7 https://en.wikipedia.org/wiki/Video_game_contr... Arnold Schwarzenegger Filmmaking 0.636364 https://en.wikipedia.org/wiki/Arnold_Schwarzen... https://en.wikipedia.org/wiki/Filmmaking Train
8 https://en.wikipedia.org/wiki/Video_game_contr... Substance dependence Filmmaking 0.000000 https://en.wikipedia.org/wiki/Substance_depend... https://en.wikipedia.org/wiki/Filmmaking Train
9 https://en.wikipedia.org/wiki/Video_game_contr... Video game development Filmmaking 0.100000 https://en.wikipedia.org/wiki/Video_game_devel... https://en.wikipedia.org/wiki/Filmmaking Train
10 https://en.wikipedia.org/wiki/Video_game_contr... CONFIG.SYS British Board of Film Classification 0.000000 https://en.wikipedia.org/wiki/CONFIG.SYS https://en.wikipedia.org/wiki/British_Board_of... Train
11 https://en.wikipedia.org/wiki/Video_game_contr... Final Fantasy European Space Agency 0.000000 https://en.wikipedia.org/wiki/Final_Fantasy https://en.wikipedia.org/wiki/European_Space_A... Train
12 https://en.wikipedia.org/wiki/Video_game_contr... Light gun Voice acting 0.200000 https://en.wikipedia.org/wiki/Light_gun https://en.wikipedia.org/wiki/Voice_acting Train
13 https://en.wikipedia.org/wiki/Video_game_contr... Personal computer Mass media 0.181818 https://en.wikipedia.org/wiki/Personal_computer https://en.wikipedia.org/wiki/Mass_media Train
14 https://en.wikipedia.org/wiki/Video_game_contr... 1990s in video gaming Racing video game 1.000000 https://en.wikipedia.org/wiki/1990s_in_video_g... https://en.wikipedia.org/wiki/Racing_video_game Train
15 https://en.wikipedia.org/wiki/Video_game_contr... Research Mental disorder 0.000000 https://en.wikipedia.org/wiki/Research https://en.wikipedia.org/wiki/Mental_disorder Train
16 https://en.wikipedia.org/wiki/Video_game_contr... Writing style Domestic violence 0.000000 https://en.wikipedia.org/wiki/Writing_style https://en.wikipedia.org/wiki/Domestic_violence Train
17 https://en.wikipedia.org/wiki/Video_game_contr... Wizardry Computer network 0.100000 https://en.wikipedia.org/wiki/Wizardry https://en.wikipedia.org/wiki/Computer_network Train
18 https://en.wikipedia.org/wiki/Video_game_contr... Goro (Mortal Kombat) Computer network 0.090909 https://en.wikipedia.org/wiki/Goro_(Mortal_Kom... https://en.wikipedia.org/wiki/Computer_network Train
19 https://en.wikipedia.org/wiki/Video_game_contr... School shooting Demon 0.000000 https://en.wikipedia.org/wiki/School_shooting https://en.wikipedia.org/wiki/Demon Train