Contracts Proposition Bank Dataset¶

This notebook explores the Contracts Proposition Bank dataset. ConProp consists of proposition bank-style annotations of legal domain sentences extracted from former IBM annual financial reports. Each of the ~1,000 sentences are annotated with a layer of “universal” Semantic Role Labels covering parts of speech, argument labeling, and predicate labeling.

Semantic Role Labeling (SRL) is a process in natural language processing that deals with structurally representing the meaning of a sentence. SRL assigns labels to words in a sentence to demonstrate the semantic relationship between the predicate and its arguments in that sentence. By generating relational language data, SRL aims to facilitate identifying the who, whom, what, where, and when in any particular sentence. SRL is particularly useful in natural language domains such as question & answering, machine translation, document summarization, and information extraction.

The ConProp dataset serves as an example of over 1,000 legal domain sentences processed via SRL. This dataset makes for great training data to train a deep neural network to perform SRL on unlabeled legal domain language. The dataset is open sourced by IBM Research and is available to download freely on the IBM Developer Data Asset Exchange: Contracts Proposition Bank Dataset. This notebook can be found on Watson Studio: Contracts Proposition Bank Notebook.

from IPython.display import clear_output

# Download & load required python packages
!pip install conllu
!pip install wordcloud
from IPython.display import Image
import requests
import tarfile
from os import path
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download('all')
from wordcloud import WordCloud
clear_output()

Download and Extract the Dataset¶

Lets download the dataset from the Data Asset Exchange Cloud Object Storage bucket and extract the tarball.

# Download the dataset
fname = 'contracts_proposition_bank.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-contracts-proposition-bank/1.0.2/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

532453

# Extract the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()

# Verify the file was extracted properly
data_path = "contracts_proposition_bank.conllx"
path.exists(data_path)

True

Read Data into Python Object¶

Lets read the data into a python object that can be used to learn more about SRL. Note, the conllu package allows for custom parsing of the data to accomodate custom variations of CoNLL data. Find more information about this on the project's PyPI page: https://pypi.org/project/conllu/

# We will use the conllu package to import the CoNLL formatted data file
from conllu import parse

# Read CoNLL text data into the data variable
with open('contracts_proposition_bank.conllx', 'r') as file:
    data = file.read()

# Use conllu's parse method to tokenize the raw data. Because this is CoNLL-X format data, we customize the parsing as explained in the next cell. 
sentences = parse(data, fields=['id', 'form', 'lemma', 'cpostag', 'postag', 'feats', 'head', 'deprel', 'phead', 'pdeprel'])

More information about the CoNLL-X format can be found in this article: https://www.aclweb.org/anthology/W06-2920.pdf. Taken directly from the article, each of the fields represent:

ID: Token counter, starting at 1 for each new sentence.
FORM: Word form or punctuation symbol. For the Arabic data only, FORM is a concatenation of the word in Arabic script and its transliteration in Latin script, separated by an underscore. This representation is meant to suit both those that do and those that do not read Arabic.
LEMMA: Lemma or stem (depending on the particular treebank) of word form, or an underscore if not available. Like for the FORM, the values for Arabic are concatenations of two scripts.
CPOSTAG: Coarse-grained part-of-speech tag, where the tagset depends on the treebank.
POSTAG: Fine-grained part-of-speech tag, where the tagset depends on the treebank. It is identical to the CPOSTAG value if no POSTAG is available from the original treebank.
FEATS: Unordered set of syntactic and/or morphological features (depending on the particular treebank), or an underscore if not available. Set members are separated by a vertical bar (|).
HEAD: Head of the current token, which is either a value of ID, or zero (’0’) if the token links to the virtual root node of the sentence. Note that depending on the original treebank annotation, there may be multiple tokens with a HEAD value of zero.
DEPREL: Dependency relation to the HEAD. The set of dependency relations depends on the particular treebank. The dependency relation of a token with HEAD=0 may be meaningful or simply ’ROOT’ (also depending on the treebank).
PHEAD: Projective head of current token, which is either a value of ID or zero (’0’), or an underscore if not available. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all data sets), whereas the structure resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).
PDEPREL: Dependency relation to the PHEAD, or an underscore if not available.

Citation: Buchholz, Sabine & Marsi, Erwin. CoNLL-X Shared Task on Multilingual Dependency Parsing. International Journal of Web Engineering and Technology - IJWET. 10.3115/1596276.1596305. (2006).

Visualize the Data¶

Lets visualize the parsed data to discover how it is structured and what we can do with it.

# Sentences is a list of tokenlist objects
type(sentences)

list

# Number of sentences we've imported
len(sentences)

1536

# Example TokenList sentence
sentences[15]

TokenList<Warranties, for, specific, Machines, are, specified, in, the, TD, .>

# Example tokenlist sentence Ordered Dictionary fields
for idx, fld in enumerate(sentences[15]):
    print("\n" + str(sentences[15][idx]))

OrderedDict([('id', 1), ('form', 'Warranties'), ('lemma', 'warranty'), ('cpostag', '_'), ('postag', 'NNS'), ('feats', None), ('head', 0), ('deprel', 'root'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 2), ('form', 'for'), ('lemma', 'for'), ('cpostag', '_'), ('postag', 'IN'), ('feats', None), ('head', 4), ('deprel', 'case'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 3), ('form', 'specific'), ('lemma', 'specific'), ('cpostag', '_'), ('postag', 'JJ'), ('feats', None), ('head', 4), ('deprel', 'amod'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 4), ('form', 'Machines'), ('lemma', 'machine'), ('cpostag', '_'), ('postag', 'NNS'), ('feats', None), ('head', 1), ('deprel', 'nmod'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 5), ('form', 'are'), ('lemma', 'be'), ('cpostag', '_'), ('postag', 'VBP'), ('feats', None), ('head', 6), ('deprel', 'auxpass'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 6), ('form', 'specified'), ('lemma', 'specify'), ('cpostag', '_'), ('postag', 'VBN'), ('feats', None), ('head', 1), ('deprel', 'acl'), ('phead', 'Y'), ('pdeprel', 'specify.01')])

OrderedDict([('id', 7), ('form', 'in'), ('lemma', 'in'), ('cpostag', '_'), ('postag', 'IN'), ('feats', None), ('head', 9), ('deprel', 'case'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 8), ('form', 'the'), ('lemma', 'the'), ('cpostag', '_'), ('postag', 'DT'), ('feats', None), ('head', 9), ('deprel', 'det'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 9), ('form', 'TD'), ('lemma', 'td'), ('cpostag', '_'), ('postag', 'NNP'), ('feats', None), ('head', 6), ('deprel', 'nmod'), ('phead', '_'), ('pdeprel', '_')])

OrderedDict([('id', 10), ('form', '.'), ('lemma', '.'), ('cpostag', '_'), ('postag', '.'), ('feats', None), ('head', 1), ('deprel', 'punct'), ('phead', '_'), ('pdeprel', '_')])

# The code was removed by Watson Studio for sharing.

# The code was removed by Watson Studio for sharing.

Dowloaded: ConPropBank - CoNLL Sentence Visualized.png from IBM COS to local: ConPropBank - CoNLL Sentence Visualized.png

'ConPropBank - CoNLL Sentence Visualized.png'

# Using https://universaldependencies.org/conllu_viewer.html we can visualize CoNLL formatted annotations in a tree-like format
Image(filename='ConPropBank - CoNLL Sentence Visualized.png')

# Let's count all the Part of Speech tags (POSTAG field) in our dataset

postag_count = {} 

for snt_idx, snt in enumerate(sentences):
    for wrd_idx, wrd in enumerate(snt):
        postag = sentences[snt_idx][wrd_idx]['postag']
        if postag in postag_count:
            postag_count[postag] += 1
        else:
            postag_count[postag] = 1

print(postag_count)

{'RB': 1235, 'VBP': 747, 'NNP': 1721, 'IN': 6425, 'NNS': 3340, 'MD': 865, 'VB': 1864, 'PRP': 753, 'CC': 2402, 'DT': 3977, 'NN': 9302, '(': 434, ',': 2475, 'VBN': 1446, ')': 599, 'WRB': 101, 'JJ': 2804, ':': 529, "''": 455, 'VBZ': 1166, 'TO': 606, 'QT': 820, 'WDT': 306, '.': 1373, 'PRP$': 482, 'JJR': 41, 'VBG': 599, 'VBD': 164, 'POS': 342, 'CD': 334, 'EX': 19, 'WP': 23, 'JJS': 24, '$': 16, 'UH': 4, 'RBR': 13, 'NNPS': 8, 'UNKNOWN': 8, 'RBS': 2, 'SYM': 1}

# Now we can plot our most common part of speech tag occurences

plt.figure(figsize=(15,10))
plt.bar(range(len(postag_count)), postag_count.values(), align='center')
plt.xticks(range(len(postag_count)), list(postag_count.keys()), rotation='vertical')

plt.show()

Looks like NN = singular noun (e.g. llama), IN = preposition (e.g. of, in, by), and DT = determiner (e.g. a, the) are the three most common part of speech tags in our dataset. For more info on part of speech tagging, checkout this presentation by the Stanford NLP group.

# Let's create a word cloud of all of the words included in this dataset. First we'll store all the available text in one string variable

text = ""

for snt_idx, snt in enumerate(sentences):
    text += " " + str(sentences[snt_idx])[10:-1]

# Next we remove casing, punctuation, special characters, and stop words and also lemmatize the words
my_new_text = re.sub('[^ a-zA-Z0-9]', '', text)
stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
word_tokens = word_tokenize(str(my_new_text.lower())) 
filtered_sentence = [w for w in word_tokens if not w in stop_words]
normalized = " ".join(lemma.lemmatize(word) for word in filtered_sentence)

# Now we can create the word cloud

wordcloud = WordCloud(max_font_size=60).generate(normalized)
plt.figure(figsize=(16,12))

'''plot wordcloud in matplotlib'''

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# Lets count the most frequent words

count = {}
for w in normalized.split():
    if w in count:
        count[w] += 1
    else:
        count[w] = 1

count100 = {}
for word, times in count.items():
    if times > 100:
        #print("%s was found %d times" % (word, times))
        count100[word] = times

# And finally we plot the most frequent words

plt.figure(figsize=(15,10))
plt.bar(range(len(count100)), count100.values(), align='center')
plt.xticks(range(len(count100)), list(count100.keys()), rotation='vertical')

plt.show()