This notebook explores the Expert in the Loop AI – Polymer Discovery dataset. The data set comprises of cyclic lactones that are adjudicated as suitable candidates for the experimental investigation. Experiment includes synthesis and characterization of the polymer materials. Adjudication accounts for the expert-level judgement about the practicality of the lactone synthesis (expectation of the low cost, short synthetic route, and high yield) and utilization as a monomer (no side-reactions, reasonable reactivity under polymerization conditions).
Expert in the Loop AI – Polymer Discovery is available to download freely on the IBM Developer Data Asset Exchange: Expert in the Loop AI – Polymer Discovery. This notebook can be found on Watson Studio: Expert in the Loop AI – Polymer Discovery-Notebook.
# Download & load required python packages
import matplotlib.pyplot as plt
import numpy as np
import requests
import re
import sys
import tarfile
import pandas as pd
from pandas import DataFrame as df
from pathlib import Path
Lets download the dataset from the Data Asset Exchange content delivery network and extract the tarball.
# Download the dataset
fname = 'dataset.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-expert-in-the-loop-ai-polymer-discovery/1.0.0/expert-in-the-loop-ai-polymer-discovery.tar.gz'
r = requests.get(url)
Path(fname).write_bytes(r.content)
# Extract the dataset
with tarfile.open(fname) as tar:
tar.extractall()
Lets read our data into a Pandas DataFrame and look at some data statistics.
data = pd.read_csv('expert-in-the-loop-ai-polymer-discovery/EITLAI_CyclicLactones_ROP_v1.csv', header=None)
#Previewing a few records
data.head()
The dataset comprises of 2 columns. The first column contains the candidate cyclic lactone sequences. The second column contains the adjucation for these sequences. The sequences are adjudicated as suitable candidates for further experimental investigation. As mentioned earlier, the adjudication accounts for the expert-level judgement about the practicality of the lactone synthesis (expectation of the low cost, short synthetic route, and high yield) and utilization as a monomer (no side-reactions, reasonable reactivity under polymerization conditions). A sequence with an adjucation (label) of "accept" is better suited for further experimentation in the polymer synthesis process. Likewise, a sequence with an adjucation of "reject" will perform poorly in this process.
data.info()
data[data[1]=='accept'].info()
data[data[1]=='reject'].info()
We can see from the above info tables that there are 118 "accepted" sequences and 125025 "rejected" sequences.
Lets visualize a subset of our data to learn a little more about its contents.
Let us look at the sequences and their lengths. While the sequences are chemical compounds and the length is not as important as the structural composition, we look at sequence lengths here to inform us of potential hyper-parameter choices in later downstream sequence modelling tasks (such as LSTM size/length). We will then look at the most common elements in sequences.
sequence_lengths = [len(x) for x in data[0]]
#Plotting raw sequence lengths, we see it's a homogenous mixture
plt.plot(sequence_lengths)
# Plotting a histogram to see the distrubtion of seq lengths
plt.hist(sequence_lengths)
# Mean and Standard deviation of above distribution.
# We can see the avg sequence is 103 in length
np.mean(sequence_lengths), np.std(sequence_lengths)
# Finally plotting a histogram of all the distinct elements in the sequences
elements = []
for x in data[0]:
elements+=list(x)
plt.hist(elements)