Split and Rephrase

This dataset contains two evaluation datasets, Wiki Benchmark and Contract Benchmark, for the Split and Rephrase task. This task takes complex sentences (complex) and then splits and rephrases them to a simpler version (simple). The complex sentences were collected from Wikipedia and legal documents. Each complex sentence was split and rewritten into several sentences by crowd workers. The two datasets contains around 800 complex sentences and more than 1,300 simplified writes in total (each complex sentence can have multiple simplified writes).  

In this notebook, you will get to explore this dataset from IBM Developer Data Asset Exchange to understand it better.

Loading the Data

Wiki Benchmark Dataset

Contracts Benchmark Dataset

Unique Number of Complex Sentences

Each complex sentence can have different simple versions. So the number of rows in the dataset does not reflect how many original complex sentences there were. Let's count the number of original complex sentences.

From the output, you can see that the number of unique original complex text is 403 but the total number of rows was 720. Next, let's see how many simple versions of the complex lines there are.

Before we do that, below is code to graph our results, which we will be using throughout the notebook. In display_plot(), it by default plots a bar plot but you can set kind="hist" for a histogram. You can call .plot() on Series and DataFrames in Pandas to create simple plots, by default matplotlib is used. In our case, we will be passing in Series.

Okay, now we can start counting. Let's break down the code below for frequency_of_number_of_rewrites:

  1. cswiki_df.complex.value_counts() - gives a count of each unique row in column "complex". That is, there are 403 unique complex texts. So for each of the unique complex texts, this prints out how many times it occurred in the whole dataset (i.e. how many "simple" version there are).
  2. cswiki_df.complex.value_counts().value_counts() - gives a count of how many "simple" versions of the complex text. That is, this will tell us how many different ways the complex texts were split/rephrased.
  3. cswiki_df.complex.value_counts().value_counts().sort_index() - sorts by the count in part 2

The plot above shows that each unique complex text was split/rewritten in 2 different ways the most. The maximum number of times a "complex" text was split/rewritten to a "simple" version is 3.

We can do the same for the Contracts dataset.

From the output, you can see that the number of unique original complex text is 406 but the total number of rows was 659.

For the Contracts dataset, most of the "complex" text had just one "simple" version.

Count Number of Sentences

Next, let's focus more on the Wikipedia dataset. However, you can replace cswiki_df with contracts_df and it will still work.

Let's count the number of sentences in the complex text and the simplified text. To do this, we apply a lambda function to the column (complex or simple). So lambda s: len(sent_tokenize(s)) means that given s (which represents a string of sentences), it will return the length of the number of sentences. The sentences are parsed using nltk's sent_tokenize() or sentence tokenizer.

Complex

Simple

From the above outputs, you will notice that the complex versions have less number of sentences (ranging from 1 to 3) while the simple versions have more setences (1 to 8). Furthermore, the complex sentences normally is just one sentence while the simplified versions are mostly three. This is rather interesting, let's look more into the sentences by exploring the number of words (or tokens).

Count Average Number of Tokens Per Sentence

We saw that the number of sentences in the simple versions tended to be more than in the original complex version. This probably means that each sentence in the simple version has less words. We can check this by finding the average number of tokens (words) per sentence.

First, we write a function (average_tokens_per_sentence()) that will help find the average number of words given a string of sentences. We use functions from nltk such as sent_tokenize() (sentence tokenizer) and word_tokenize() (word tokenizer). E.g.

An example input to average_tokens_per_sentence() is "This is sentence one. This is sentence two.", the output would be 4.

First, we look at the complex sentences.

Next, we look at the simple sentences.

From the above outputs, we do indeed see that the complex version has more words per sentence (the peak is around 25) while the simple versions have less (the peak is around 10).

Now that you have a better understanding of the data, you can use these datasets to explore natural language processing models.