Visualize Oil Reservoir Simulations Dataset Features¶

This notebook relates to the Oil Reservoir Simulations Dataset (ORSD). This dataset contains tens of thousands of simulations generated by a physics-based simulator on a basic oil-reservoir field model. Each simulation comprises an input action sequence and an output prediction sequence. The goal of this dataset is to aid the development of machine-learning models that can accurately predict the output sequence, given an input sequence, thus offering a large data corpus for evaluating sequence-to-sequence models. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.

In this notebook, we explore the Aspect 3: Features and Post-processing of the ORSD. We share the input and output sequences after pre-processing in the previous notebook. This includes input features, i.e., drilling actions and drilling location which are categorical variables (properly encoded), and geological features and control features (included as real-valued vectors). Both input and output variables are standardized.

Information provided in this notebook will allow ML researchers to save the non-trivial effort of encoding the sequential information from the json files.

0. Prerequisites¶

Before you run this notebook complete the following steps:

Insert a project token
Install and import required packages

Insert a project token¶

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

Click on More -> Insert project token in the top-right menu section

ws-project.mov

This should insert a cell at the top of this notebook similar to the example given above.

If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below

Import required packages¶

Install and import the required packages:

numpy
pandas
matplotlib
h5py
json
pickle

# Installing packages needed for data processing and visualization
!pip install numpy pandas matplotlib h5py

Requirement already satisfied: numpy in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.15.4)
Requirement already satisfied: pandas in /opt/conda/envs/Python36/lib/python3.6/site-packages (0.24.1)
Requirement already satisfied: matplotlib in /opt/conda/envs/Python36/lib/python3.6/site-packages (3.0.2)
Requirement already satisfied: h5py in /opt/conda/envs/Python36/lib/python3.6/site-packages (2.9.0)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas) (2.7.5)
Requirement already satisfied: pytz>=2011k in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas) (2018.9)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from matplotlib) (1.0.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from matplotlib) (2.3.1)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from h5py) (1.12.0)
Requirement already satisfied: setuptools in /opt/conda/envs/Python36/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.8.0)

# Define required imports
import io
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import h5py as h5
import pickle

1. Read Dataset¶

In this section we take a closer look at the data used in Aspects 3 of the ORSD. Let's list out all files and assets included in this project to get a better sense of what we will be working with.

# Extract a sorted list of all assets associated with the project
sorted([d['name'] for d in project.get_assets()])

['Part 0 - Import data',
 'Part 1 - Dataset overview',
 'Part 2 - Data visualization',
 'SPE9-MAX.Aspect3.compressed.h5',
 'SPE9-TRIANGLE.Aspect3.compressed.h5',
 'SPE9-geology.h5',
 'SPE9-geology.normed.h5',
 'ddddf3a7-a68a-4502-83fc-4e80c4d5fa37.json',
 'ddddf3a7-a68a-4502-83fc-4e80c4d5fa37.sch']

In this notebook, we will primarily work on two files, SPE9-TRIANGLE.Aspect3.compressed.h5 and SPE9-MAX.Aspect3.compressed.h5. These files represent 4-dimensional tensors containing geological features extracted from the SPE9 reservoir model.

SPE9-MAX: This dataset contains simulations generated by random drilling sequences that distribute uniformly over the entire surface grid of the SPE9 RM (24x25).
SPE9-TRIANGLE: This dataset contains simulations generated by random drilling sequences that distribute uniformly over a constrained triangular portions of the reservoir surface.

For more information of data files, please see the previous notebook.

# Extract datasets in project storage
trianglefn = project.get_file("SPE9-TRIANGLE.Aspect3.compressed.h5")
maxfn = project.get_file("SPE9-MAX.Aspect3.compressed.h5")

2. Data Features¶

In this section, we show a sample walk-through of the data in SPE9-TRIANGLE.Aspect3.compressed.h5 file. We discuss the representations of input and output variables, including train, dev1/dev2, test, etc. Also, we investigate the additional objects, such as inp_stddims, inp_stdscalerobj, hash_*, etc. which are needed in future deep learning analysis.

# list contents and their shapes
with h5.File(trianglefn) as f:
    for i in list(f):
        print("%s %s" % (i, str(f[i].shape)))

X_dev-mismatch (2000, 100, 248)
X_dev1-match (1000, 100, 248)
X_dev2-match (999, 100, 248)
X_test-match (1000, 100, 248)
X_test-mismatch (1000, 100, 248)
X_train (24000, 100, 248)
Y_dev-mismatch (2000, 100, 4)
Y_dev1-match (1000, 100, 4)
Y_dev2-match (999, 100, 4)
Y_test-match (1000, 100, 4)
Y_test-mismatch (1000, 100, 4)
Y_train (24000, 100, 4)
fea_descr ()
hash_dev-mismatch (2000,)
hash_dev1-match (1000,)
hash_dev2-match (999,)
hash_test-match (1000,)
hash_test-mismatch (1000,)
hash_train (24000,)
inp_stddims (200,)
inp_stdscalerobj (5325,)
outp_stddims (4,)
outp_stdscalerobj (612,)

Input / Output Data¶

In the above, tensors prepended with X_ and Y_ contain the input and output variables, respectively. These are further partitioned as follows:

train - Training batch
dev1 / dev2 - Development partitions used for various cross-valiadtion and tuning tasks.
test - Test set, used for eventual evaluation. Suffixes -match and -mismatch refer to matched (non-drift) and mismatched (drift) condition with respect to the training condition.

Each X/Y tensor is of dimension [simulation, time within sequence, variable dimension]. For instance, the shape of X_dev-mismatch is (2000, 100, 248) which means this tensor contains a batch of 2000 simualtions each of which has a length of 100 timesteps each of which is a 248-dimensional feature vector (Input). Aligning with this batch, Y_dev-mismatch of shape (2000, 100, 4) has the corresponding output 4-dimensional variables (Output).

Additional objects¶

Besides raw data, the above h5 file also contains several objects that may be needed for further work, namely:

inp_stddims/outp_stddims are lists of 0-based indexes of input/output variable to which standardization has been applied.
inp_stdscalerobj/outp_stdscalerobj are pickled StandardScaler objects that were used to standardize the designated dimensions in the input/output. An example of loading one of these:
```
from sklearn.preprocessing import StandardScaler
import pickle
scy = pickle.loads(h5f['outp_stdscalerobj'].value)
```
This object can then be used to undo the standardization or to apply same standardization to additional data if/when needed.
hash_* objects are lists with hash IDs identifying the individual simulations in each partition. The IDs are consistent with those in the *.json files as well as the simulation folders.
fea_descr is a specification of the type and dimensionality of the input feature variables. It is a list:
```
[["discrete", 3], ["discrete", 25], ["discrete", 26], ["vector", 45], ["vector", 200, "std"]]
```
indicating that the first dimension in X (["discrete", 3]) is a discrete (categorical) variable that can assume 3 values (0, 1, 2). The second and third dimensions of X are again categorical with cardinalities 25 and 26, resp. These are followed by a 45-dimensional real-valued vector (["vector", 45]), and another 200-dimensional vector, which is standardized (["vector", 200, "std"]).

In the final version of release we will show an example of reading the X/Y tensors and connecting them to a learnable embedding within a Deep Neural Network.

Similarly, here is a break-down of the SPE9-MAX dataset. Due to its simpler nature (no mis/match differentiation), there are fewer partitions in this file. However, a repartitioning can be performed as needed.

# list contents and their shapes
with h5.File(maxfn) as f:
    for i in list(f):
        print("%s %s" % (i, str(f[i].shape)))

X_dev (3000, 100, 248)
X_test (3000, 100, 248)
X_train (24000, 100, 248)
Y_dev (3000, 100, 4)
Y_test (3000, 100, 4)
Y_train (24000, 100, 4)
fea_descr ()
hash_dev (3000,)
hash_test (3000,)
hash_train (24000,)
inp_stddims (200,)
inp_stdscalerobj (5325,)
outp_stddims (4,)
outp_stdscalerobj (612,)

To summarize, the input for a single time step of a simulation is a 248 dimensional vector encoding actions and the geography and the corresponding output is 4 yield rates (oil, gas, water and water injection).

3. Sample Visualization¶

In this section, we will visualize the action sequence and production rates sequence of the sample dataset. We want to investigate whether location is an important factor in explaining the differences of output sequence.

3.1 Input (Action Sequence)¶

This visualization will map the locations of producer well and injector well by investigating the patterns of well locations in match, mismatch and entire regions.

def visualize_SPE9Actions(X, which_sim):
    prodl = "producer well"
    injl = "injector well"
    s = X[which_sim, :, 0:3]
    plt.figure(figsize=(5,5))
    for i in range(len(s)):
        w, x, y = s[i]
        if w==0:
            style = "go"            
            l = None
        elif w==1:
            style = "r+" 
            l = prodl
            prodl = None
        else:
            style = "b*"
            l = injl 
            injl = None
        plt.plot(x, y, style, label=l)
    plt.legend(loc=0)
    plt.title('Reservoir surface view (simulation %d)' % which_sim)
    plt.xlabel('Surface coordinate X')
    plt.ylabel('Surface coordinate Y')
    plt.xticks([5, 10, 15, 20, 23])
    plt.yticks([5, 10, 15, 20, 24])

# Plot action sequence from the first simulation of the test partition
with h5.File(trianglefn, 'r') as f:
    X = f['X_test-match'].value
which_sim = 0
visualize_SPE9Actions(X, which_sim=which_sim)

/opt/conda/envs/Python36/lib/python3.6/site-packages/h5py/_hl/dataset.py:313: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
  "Use dataset[()] instead.", H5pyDeprecationWarning)

The above shows 100 drilling actions on the surface of the SPE9 reservoir. Some of these drilled producer wells, others injector wells. Each type fulfils a different function in the reservoir engineering context and so their type as well as location their are an important factor in explaining the output sequence. As can be seen in the above sample, only one of the triangular areas is covered, in this particular case it is the one matching the training region (we loaded the X_test-match).

Let's see how a drilling sample from the mismatched region looks like:

# Plot action sequence from the first simulation of the test partition
with h5.File(trianglefn, 'r') as f:
    X = f['X_test-mismatch'].value
which_sim = 0
visualize_SPE9Actions(X, which_sim=which_sim)

Clearly, a ML model trained in the matched condition may be challenged by a sample like this where all the actions are located in a region previously unseen. This is precisely the goal of the SPE9-TRIANGLE dataset! (we loaded the X_test-mismatch)

In contrast, the SPE9-MAX dataset distributes drilling actions over the entire reservoir surface:

# Plot action sequence from the first simulation of the test partition
with h5.File(maxfn, 'r') as f:
    X = f['X_test'].value
which_sim = 0
visualize_SPE9Actions(X, which_sim=which_sim)

Please note that while all action sequences are uniformly distributed over their particular region, the distribution of Injectors and Producers obeys further (geological) constraints, namely the water saturation in a given location. Hence, there are distributional patterns of these well types visible in the above samples.

3.2 Output (Production Rates Sequence)¶

Now, let's look at the corresponding output sequence in match and mis-match conditions.

# Define visualize_SPE9PR function
def visualize_SPE9PR(Y, which_simulation):
    fig, ax = plt.subplots(Y.shape[-1], 1, figsize=(8,15))
    compdescr = ['Oil Prod. Rate', 'Water Prod. Rate', 'Gas Prod. Rate', 'Water Inj. Rate']
    Dplt = Y[which_simulation, ...]
    t = range(Dplt.shape[0])
    for c in range(Dplt.shape[-1]):
        ax[c].plot(t, Dplt[:,c], label='Simulator output', linestyle='-', linewidth=2)        
        ax[c].set_ylabel(compdescr[c])
        ax[c].grid()
        ax[c].legend()
        if c == 0: 
            ax[c].set_title('Simulation %d' % which_simulation)
        if c == 3:
            ax[c].set_xlabel('Time (90 days period)')

Let us plot the first simulation from the SPE9-TRIANGLE test set. To display the actual outputs, we also exercise undoing the standardization. We want to compare the output oil rate during the same 90 days time frame.

# Open dataset, load
with h5.File(trianglefn, 'r') as f:
    Y = f['Y_test-match'].value
    scy = pickle.loads(f['outp_stdscalerobj'].value)
# Undo standardization
Y = Y * scy.scale_ + scy.mean_
# Apply function visualize_SPE9PR
visualize_SPE9PR(Y, 0)

The above are the four output variables of interest, namely the Oil Production Rate, Water Production Rate, Gas Production Rate, and Water Injection Rate, and their time progressions. Each time step represents 90 days of real time. These rates are also referred to as field rates as they are the aggregate measure over all individual wells drilled. In the Aspect 3, we only carry on with field rates, however, the Aspect 2 in the previous notebook data also include the individual well outputs allowing for more refined modeling.

# Open dataset, load
with h5.File(trianglefn, 'r') as f:
    Y = f['Y_test-mismatch'].value
    scy = pickle.loads(f['outp_stdscalerobj'].value)
# Undo standardization
Y = Y * scy.scale_ + scy.mean_
# Apply function visualize_SPE9PR
visualize_SPE9PR(Y, 0)

/opt/conda/envs/Python36/lib/python3.6/site-packages/h5py/_hl/dataset.py:313: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
  "Use dataset[()] instead.", H5pyDeprecationWarning)

Similarly, in the mis-matched region, we compare the same output in 90 days period. We can see that the mismatch condition follows similar trend as the match one, but the peaks of four variables appear at different moment depending on individual wells drilled. The differences in distribution of simulator output answer that location is an important factor of oil production rate.

Next Steps¶

Close the notebook.

Authors¶

This notebook was created by the Center for Open-Source Data & AI Technologies.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up