IBM Debater® Mention Detection Benchmark

Format	ANN
License	CC-BY-SA 3.0
Domain	Natural Language Processing
Number of Records	3,000 sentences, 6375 mentions in the wikipedia sentences and 6239 mentions in the spoken sentences.
Data Split	Train - 1,500 sentences and mentions \| Test - 1,500 sentences and mentions
Size	1.8MB
Author	Yosi Mass, Lili Kolterman
Data Origin	IBM Research
Dataset Version	Version 1 - January 25, 2018
Dataset Coverage	This dataset contains 3000 sentences taken from Wikipedia articles (1000), cleansed manual transcription (1000) and output of an automated speech recognition engine (1000) discussing different topics. Some of the topics covered are in Debatabase.
Business Use Case	News & Entertainment This dataset can be use for powering content recommendation like news articles, shows, and so on.

Sentence	Annotation
when power is amongst the the cleaner sources of energy with the lowest environmental impact	when power\|\|\|http://dbpedia.org/resource/Wind_power\|\|\|0\|\|\|10 cleaner sources of energy\|\|\|http://dbpedia.org/resource/Renewable_energy\|\|\|30\|\|\|55 cleaner sources of energy\|\|\|http://dbpedia.org/resource/Sustainable_energy\|\|\|30\|\|\|55 environmental impact\|\|\|http://dbpedia.org/resource/Environmental_impact_assessment\|\|\|72\|\|\|92 environmental impact\|\|\|http://dbpedia.org/resource/Environmental_issue\|\|\|72\|\|\|92
Irish are the largest per capita consumers of video games [REF].	Irish\|\|\|http://dbpedia.org/resource/Irish_people\|\|\|0\|\|\|5 Irish\|\|\|http://dbpedia.org/resource/Ireland\|\|\|0\|\|\|5 per capita\|\|\|http://dbpedia.org/resource/Per_capita\|\|\|22\|\|\|32 consumers\|\|\|http://dbpedia.org/resource/Consumer\|\|\|33\|\|\|42 video games\|\|\|http://dbpedia.org/resource/Video_game\|\|\|46\|\|\|57

Directory	Description
Data Directory	Data directory contains 6 folders for the 3 datasets. Each dataset is split into two folders, dev and test. The directories are Wiki-dev, Wiki-test Trans-dev, Trans-test ASR-dev, ASR-test. Each directory has two subfolders, "paragraphs" with 500 sentences and "labeled" for the labeled mentions on them. Each file has format of topicid-articleid-start-end. Where, - topicid: the id of the topic used for that sentence. The topicids appear in a separate directory (see below) - articleid: some internal id for the article from which the sentence was taken - start: the start offset of the sentence in the article - end: the end offset of the sentence in the article For example 101_1169486036_0_266.txt for a sentence and 101_1169486036_0_266.ann for the annotations on that sentence. The format of the .ann files is a separate line for each mention with the following structure - term\|\|\|entity-uri\|\|\|start\|\|\|end where, - term: term in the coresponding sentence. - entity-uri: enity page in wikipedia - start: the start offset of the term in the sentence - end: the end offset of the term in the sentence For example, history\|\|\|http://dbpedia.org/resource/History\|\|\|34\|\|\|41.
topics.csv	This contains the topic files that were associated with the sentences. For example, 1. We should ban the sale of violent video games to minors 21. The one-child policy brought more good than harm
attribution directory	Contains attribution files for the Wikipedia sentences with pointers to the articles from where they were taken.