This dataset contains 3000 sentences taken from Wikipedia articles (1000), cleansed manual transcription (1000) and output of an automated speech recognition engine (1000) discussing different topics. Some of the topics covered are in Debatabase.
Business Use Case
News & Entertainment This dataset can be use for powering content recommendation like news articles, shows, and so on.
Sentence
Annotation
when power is amongst the the cleaner sources of energy with the lowest environmental impact
when power|||http://dbpedia.org/resource/Wind_power|||0|||10 cleaner sources of energy|||http://dbpedia.org/resource/Renewable_energy|||30|||55 cleaner sources of energy|||http://dbpedia.org/resource/Sustainable_energy|||30|||55 environmental impact|||http://dbpedia.org/resource/Environmental_impact_assessment|||72|||92 environmental impact|||http://dbpedia.org/resource/Environmental_issue|||72|||92
Irish are the largest per capita consumers of video games [REF].
Irish|||http://dbpedia.org/resource/Irish_people|||0|||5 Irish|||http://dbpedia.org/resource/Ireland|||0|||5 per capita|||http://dbpedia.org/resource/Per_capita|||22|||32 consumers|||http://dbpedia.org/resource/Consumer|||33|||42 video games|||http://dbpedia.org/resource/Video_game|||46|||57
Directory
Description
Data Directory
Data directory contains 6 folders for the 3 datasets. Each dataset is split into two folders, dev and test. The directories are
Each directory has two subfolders, "paragraphs" with 500 sentences and "labeled" for the labeled mentions on them. Each file has format of topicid-articleid-start-end. Where, - topicid: the id of the topic used for that sentence. The topicids appear in a separate directory (see below) - articleid: some internal id for the article from which the sentence was taken - start: the start offset of the sentence in the article - end: the end offset of the sentence in the article For example 101_1169486036_0_266.txt for a sentence and 101_1169486036_0_266.ann for the annotations on that sentence.
The format of the .ann files is a separate line for each mention with the following structure - term|||entity-uri|||start|||end where, - term: term in the coresponding sentence. - entity-uri: enity page in wikipedia - start: the start offset of the term in the sentence - end: the end offset of the term in the sentence For example, history|||http://dbpedia.org/resource/History|||34|||41.
topics.csv
This contains the topic files that were associated with the sentences. For example, 1. We should ban the sale of violent video games to minors 21. The one-child policy brought more good than harm
attribution directory
Contains attribution files for the Wikipedia sentences with pointers to the articles from where they were taken.