IBM Debater® Mention Detection Benchmark

Data Source: https://developer.ibm.com/exchanges/data/all/mention-detection-benchmark/

The goal of Mention Detection is to map entities/concepts mentioned in text to the correct concept in a knowledge base. The dataset contains 3000 sentences that are annotated with Mentions.

Dataset Overview:

The 3000 sentences are divided as follows.

– 1000 sentences taken from Wikipedia articles that discuss various topics, such as those in Debatabase (http://idebate.org/debatabase)

– 1000 sentences taken from professional speakers discussing some of those topics. Those sentences have two forms (thus resulting in 2000 sentence): the output of an Automatic Speech Recognition (ASR) engine; and a cleansed manual transcription of it.

In this notebook, we are going to explore the following datasets: Wiki-dev , Trans-dev and ASR-dev

Function to read dev files

DATA EXPLORATION FOR WIKI DEV DATASET

Create list containing following fields:

  1. Sentence id
  2. Sentence
  3. Token
  4. Entity (from entity URL mentioned in .ann file)
  5. Token start
  6. Token end

DATA EXPLORATION FOR TRANS DEV DATASET

DATA EXPLORATION FOR ASR DEV DATASET

CONCATENATE ALL DF