Split and Rephrase¶

This dataset contains two evaluation datasets, Wiki Benchmark and Contract Benchmark, for the Split and Rephrase task. This task takes complex sentences (complex) and then splits and rephrases them to a simpler version (simple). The complex sentences were collected from Wikipedia and legal documents. Each complex sentence was split and rewritten into several sentences by crowd workers. The two datasets contains around 800 complex sentences and more than 1,300 simplified writes in total (each complex sentence can have multiple simplified writes).

In this notebook, you will get to explore this dataset from IBM Developer Data Asset Exchange to understand it better.

Loading the Data¶

In [1]:

from nltk.tokenize import sent_tokenize, word_tokenize
import os
import pandas as pd
import numpy as np

In [2]:

DIRECTORY = "split-and-rephrase-data"
FILE_NAMES = ["cswiki-perfect.csv", "contracts-perfect.csv"]

Wiki Benchmark Dataset¶

In [3]:

cswiki_df, contracts_df = [pd.read_csv(os.path.join(DIRECTORY, file_name)) for file_name in FILE_NAMES]

In [4]:

cswiki_df.head()

Out[4]:

	complex	simple
0	Together with James, she compiled crosswords f...	Together with James, she compiled crosswords. ...
1	Mantell was born in Bridgwater, Somerset, and ...	Mantell was born in Bridgwater, Somerset. Mant...
2	Mantell was born in Bridgwater, Somerset, and ...	Mantell was born in Bridgwater, Somerset. He ...
3	Referee Toby Irwin seemed oblivious to what wa...	Referee Toby Irwin seemed oblivious to what wa...
4	Yazan Halwani is a street artist from Beirut, ...	Yazan Halwani is a street artist. Yazan is fro...

Contracts Benchmark Dataset¶

In [5]:

contracts_df.head()

Out[5]:

	complex	simple
0	Except for Supplier's obligations and liabilit...	The following applies, not including the Suppl...
1	Except for Supplier's obligations and liabilit...	Supplier's liability for any and all claims wi...
2	Except for Supplier's obligations and liabilit...	Except for Supplier's obligations and liabilit...
3	Regardless of the basis on which a party seeki...	This is regardless of the basis on which a par...
4	BP further represents that all removed parts w...	BP further represents that all removed parts w...

Unique Number of Complex Sentences¶

Each complex sentence can have different simple versions. So the number of rows in the dataset does not reflect how many original complex sentences there were. Let's count the number of original complex sentences.

In [6]:

cswiki_df.describe()

Out[6]:

	complex	simple
count	720	720
unique	403	720
top	A particular concern is the treatment of Musli...	The second case in Mali was that of the nurse ...
freq	3	1

From the output, you can see that the number of unique original complex text is 403 but the total number of rows was 720. Next, let's see how many simple versions of the complex lines there are.

Before we do that, below is code to graph our results, which we will be using throughout the notebook. In display_plot(), it by default plots a bar plot but you can set kind="hist" for a histogram. You can call .plot() on Series and DataFrames in Pandas to create simple plots, by default matplotlib is used. In our case, we will be passing in Series.

In [7]:

def display_plot(series_obj, title, xlabel, ylabel, kind="bar"):
    ax = series_obj.plot(kind=kind, title=title);
    
    # adds frequency labels to each bar (if histogram then don't)
    if kind == "bar":
        for patch in ax.patches:
            ax.annotate(str(patch.get_height()), (patch.get_x(), patch.get_height()))
            
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel);

Okay, now we can start counting. Let's break down the code below for frequency_of_number_of_rewrites:

cswiki_df.complex.value_counts() - gives a count of each unique row in column "complex". That is, there are 403 unique complex texts. So for each of the unique complex texts, this prints out how many times it occurred in the whole dataset (i.e. how many "simple" version there are).
cswiki_df.complex.value_counts().value_counts() - gives a count of how many "simple" versions of the complex text. That is, this will tell us how many different ways the complex texts were split/rephrased.
cswiki_df.complex.value_counts().value_counts().sort_index() - sorts by the count in part 2

In [8]:

frequency_of_number_of_rewrites = cswiki_df.complex.value_counts().value_counts().sort_index()
display_plot(frequency_of_number_of_rewrites, 
             'Wiki Benchmark',
             'Number of Simple Versions of Unique Complex Text',
             'Frequency'
            )

The plot above shows that each unique complex text was split/rewritten in 2 different ways the most. The maximum number of times a "complex" text was split/rewritten to a "simple" version is 3.

We can do the same for the Contracts dataset.

In [9]:

contracts_df.describe()

Out[9]:

	complex	simple
count	659	659
unique	406	659
top	This questionnaire must be completed by a vend...	In the event of such termination, Trustees sha...
freq	3	1

From the output, you can see that the number of unique original complex text is 406 but the total number of rows was 659.

In [10]:

frequency_of_number_of_rewrites2 = contracts_df.complex.value_counts().value_counts().sort_index()
display_plot(frequency_of_number_of_rewrites2, 
             'Contracts Benchmark',
             'Number of Simple Versions of Unique Complex Text',
             'Frequency'
            )

For the Contracts dataset, most of the "complex" text had just one "simple" version.

Count Number of Sentences¶

Next, let's focus more on the Wikipedia dataset. However, you can replace cswiki_df with contracts_df and it will still work.

Let's count the number of sentences in the complex text and the simplified text. To do this, we apply a lambda function to the column (complex or simple). So lambda s: len(sent_tokenize(s)) means that given s (which represents a string of sentences), it will return the length of the number of sentences. The sentences are parsed using nltk's sent_tokenize() or sentence tokenizer.

Complex¶

In [11]:

sentences_complex = cswiki_df.complex.apply(lambda s: len(sent_tokenize(s))).value_counts().sort_index()
display_plot(sentences_complex, 
             'Complex Sentences (Wikipedia Benchmark)',
             'Number of Sentences',
             'Frequency'
            )

Simple¶

In [12]:

sentences_simple = cswiki_df.simple.apply(lambda s: len(sent_tokenize(s))).value_counts().sort_index()
display_plot(sentences_simple, 
             'Simple Sentences (Wikipedia Benchmark)',
             'Number of Sentences',
             'Frequency'
            )

From the above outputs, you will notice that the complex versions have less number of sentences (ranging from 1 to 3) while the simple versions have more setences (1 to 8). Furthermore, the complex sentences normally is just one sentence while the simplified versions are mostly three. This is rather interesting, let's look more into the sentences by exploring the number of words (or tokens).

Count Average Number of Tokens Per Sentence¶

We saw that the number of sentences in the simple versions tended to be more than in the original complex version. This probably means that each sentence in the simple version has less words. We can check this by finding the average number of tokens (words) per sentence.

First, we write a function (average_tokens_per_sentence()) that will help find the average number of words given a string of sentences. We use functions from nltk such as sent_tokenize() (sentence tokenizer) and word_tokenize() (word tokenizer). E.g.

sent_tokenize("This is sentence one. This is sentence two.") => ["This is sentence one.", "This is sentence two."]
word_tokenize("This is sentence one.".strip('.')) => ['This', 'is', 'sentence', 'one']

An example input to average_tokens_per_sentence() is "This is sentence one. This is sentence two.", the output would be 4.

In [13]:

def average_tokens_per_sentence(sentences_string):
    """
    Given a string of sentences, return the mean number of words per sentence. 
    """

    # split the string into list of sentences
    sentences = sent_tokenize(sentences_string)
    
    tokens_per_sentences = []
    
    # loop through list of sentences
    for sentence in sentences:
        # tokenize each sentence to get each word, do not include '.'
        tokens = word_tokenize(sentence.strip('.'))
        # save the number of words in sentence
        tokens_per_sentences.append(len(tokens))

    # average the number of words
    return np.mean(tokens_per_sentences)

First, we look at the complex sentences.

In [14]:

avg_words_complex = cswiki_df.complex.apply(average_tokens_per_sentence)
display_plot(avg_words_complex, 
             'Complex Sentences (Wikipedia Benchmark)',
             'Average Number of Words Per Sentence',
             'Frequency',
             kind="hist"
            )

Next, we look at the simple sentences.

In [15]:

avg_words_complex = cswiki_df.simple.apply(average_tokens_per_sentence)
display_plot(avg_words_complex, 
             'Simple Sentences (Wikipedia Benchmark)',
             'Average Number of Words Per Sentence',
             'Frequency',
             kind="hist"
            )

From the above outputs, we do indeed see that the complex version has more words per sentence (the peak is around 25) while the simple versions have less (the peak is around 10).

Now that you have a better understanding of the data, you can use these datasets to explore natural language processing models.

In [ ]: