Expert in the Loop AI – Polymer Discovery

This notebook explores the Expert in the Loop AI – Polymer Discovery dataset. The data set comprises of cyclic lactones that are adjudicated as suitable candidates for the experimental investigation. Experiment includes synthesis and characterization of the polymer materials. Adjudication accounts for the expert-level judgement about the practicality of the lactone synthesis (expectation of the low cost, short synthetic route, and high yield) and utilization as a monomer (no side-reactions, reasonable reactivity under polymerization conditions).

Expert in the Loop AI – Polymer Discovery is available to download freely on the IBM Developer Data Asset Exchange: Expert in the Loop AI – Polymer Discovery. This notebook can be found on Watson Studio: Expert in the Loop AI – Polymer Discovery-Notebook.

In [1]:
# Download & load required python packages
import matplotlib.pyplot as plt
import numpy as np
import requests
import re
import sys
import tarfile
import pandas as pd
from pandas import DataFrame as df
from pathlib import Path

Download and Extract the Data

Lets download the dataset from the Data Asset Exchange content delivery network and extract the tarball.

In [2]:
# Download the dataset
fname = 'dataset.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-expert-in-the-loop-ai-polymer-discovery/1.0.0/expert-in-the-loop-ai-polymer-discovery.tar.gz'
r = requests.get(url)
Path(fname).write_bytes(r.content)
Out[2]:
1476205
In [3]:
# Extract the dataset
with tarfile.open(fname) as tar:
    tar.extractall()

Read the Data

Lets read our data into a Pandas DataFrame and look at some data statistics.

In [4]:
data = pd.read_csv('expert-in-the-loop-ai-polymer-discovery/EITLAI_CyclicLactones_ROP_v1.csv', header=None)
In [5]:
#Previewing a few records
data.head()
Out[5]:
0 1
0 NC(N)=NC(=O)CC1COC(=O)OC1 accept
1 O=COCC1COC(=O)OC1 accept
2 O=C1OCC(CC2CCCOC2=O)CO1 accept
3 C=C(C)C(=O)CC1COC(=O)OC1 accept
4 O=COCC(=O)CC1COC(=O)OC1 accept

The dataset comprises of 2 columns. The first column contains the candidate cyclic lactone sequences. The second column contains the adjucation for these sequences. The sequences are adjudicated as suitable candidates for further experimental investigation. As mentioned earlier, the adjudication accounts for the expert-level judgement about the practicality of the lactone synthesis (expectation of the low cost, short synthetic route, and high yield) and utilization as a monomer (no side-reactions, reasonable reactivity under polymerization conditions). A sequence with an adjucation (label) of "accept" is better suited for further experimentation in the polymer synthesis process. Likewise, a sequence with an adjucation of "reject" will perform poorly in this process.

In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125143 entries, 0 to 125142
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       125143 non-null  object
 1   1       125143 non-null  object
dtypes: object(2)
memory usage: 1.9+ MB
In [7]:
data[data[1]=='accept'].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 118 entries, 0 to 117
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       118 non-null    object
 1   1       118 non-null    object
dtypes: object(2)
memory usage: 2.8+ KB
In [8]:
data[data[1]=='reject'].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 125025 entries, 118 to 125142
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       125025 non-null  object
 1   1       125025 non-null  object
dtypes: object(2)
memory usage: 2.9+ MB

We can see from the above info tables that there are 118 "accepted" sequences and 125025 "rejected" sequences.

Visualize the Data

Lets visualize a subset of our data to learn a little more about its contents.

Let us look at the sequences and their lengths. While the sequences are chemical compounds and the length is not as important as the structural composition, we look at sequence lengths here to inform us of potential hyper-parameter choices in later downstream sequence modelling tasks (such as LSTM size/length). We will then look at the most common elements in sequences.

In [9]:
sequence_lengths = [len(x) for x in data[0]]
In [10]:
#Plotting raw sequence lengths, we see it's a homogenous mixture
plt.plot(sequence_lengths)
Out[10]:
[<matplotlib.lines.Line2D at 0x7ff82ee0af98>]
In [11]:
# Plotting a histogram to see the distrubtion of seq lengths
plt.hist(sequence_lengths)
Out[11]:
(array([1.3750e+03, 4.5882e+04, 3.7014e+04, 2.7752e+04, 7.5550e+03,
        3.9490e+03, 1.3090e+03, 2.1500e+02, 3.4000e+01, 5.8000e+01]),
 array([ 14. ,  48.3,  82.6, 116.9, 151.2, 185.5, 219.8, 254.1, 288.4,
        322.7, 357. ]),
 <a list of 10 Patch objects>)
In [12]:
# Mean and Standard deviation of above distribution.
# We can see the avg sequence is 103 in length
np.mean(sequence_lengths), np.std(sequence_lengths)
Out[12]:
(103.17281030501107, 39.81549915560055)
In [13]:
# Finally plotting a histogram of all the distinct elements in the sequences
elements = []
for x in data[0]:
    elements+=list(x)
In [14]:
plt.hist(elements)
Out[14]:
(array([6.493176e+06, 2.542684e+06, 1.310071e+06, 1.706544e+06,
        4.068000e+05, 3.238000e+03, 3.580020e+05, 6.084400e+04,
        1.794200e+04, 1.205400e+04]),
 array([ 0. ,  3.3,  6.6,  9.9, 13.2, 16.5, 19.8, 23.1, 26.4, 29.7, 33. ]),
 <a list of 10 Patch objects>)