This notebook relates to the Oil Reservoir Simulations Dataset (ORSD). This dataset contains tens of thousands of simulations generated by a physics-based simulator on a basic oil-reservoir field model. Each simulation comprises an input action sequence and an output prediction sequence. The goal of this dataset is to aid the development of machine-learning models that can accurately predict the output sequence, given an input sequence, thus offering a large data corpus for evaluating sequence-to-sequence models. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.
In this notebook, we explore the Aspect 3: Features and Post-processing
of the ORSD. We share the input and output sequences after pre-processing in the previous notebook. This includes input features, i.e., drilling actions and drilling location which are categorical variables (properly encoded), and geological features and control features (included as real-valued vectors). Both input and output variables are standardized.
Information provided in this notebook will allow ML researchers to save the non-trivial effort of encoding the sequential information from the json files.
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
Install and import the required packages:
# Installing packages needed for data processing and visualization
!pip install numpy pandas matplotlib h5py
# Define required imports
import io
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import h5py as h5
import pickle
In this section we take a closer look at the data used in Aspects 3 of the ORSD. Let's list out all files and assets included in this project to get a better sense of what we will be working with.
# Extract a sorted list of all assets associated with the project
sorted([d['name'] for d in project.get_assets()])
In this notebook, we will primarily work on two files, SPE9-TRIANGLE.Aspect3.compressed.h5
and SPE9-MAX.Aspect3.compressed.h5
. These files represent 4-dimensional tensors containing geological features extracted from the SPE9 reservoir model.
For more information of data files, please see the previous notebook.
# Extract datasets in project storage
trianglefn = project.get_file("SPE9-TRIANGLE.Aspect3.compressed.h5")
maxfn = project.get_file("SPE9-MAX.Aspect3.compressed.h5")
In this section, we show a sample walk-through of the data in SPE9-TRIANGLE.Aspect3.compressed.h5
file. We discuss the representations of input and output variables, including train
, dev1/dev2
, test
, etc. Also, we investigate the additional objects, such as inp_stddims
, inp_stdscalerobj
, hash_*
, etc. which are needed in future deep learning analysis.
# list contents and their shapes
with h5.File(trianglefn) as f:
for i in list(f):
print("%s %s" % (i, str(f[i].shape)))
In the above, tensors prepended with X_
and Y_
contain the input and output variables, respectively. These are further partitioned as follows:
train
- Training batch dev1
/ dev2
- Development partitions used for various cross-valiadtion and tuning tasks.test
- Test set, used for eventual evaluation.
Suffixes -match
and -mismatch
refer to matched (non-drift) and mismatched (drift) condition with respect to the training condition. Each X/Y tensor is of dimension [simulation, time within sequence, variable dimension]. For instance, the shape of X_dev-mismatch
is (2000, 100, 248)
which means this tensor contains a batch of 2000 simualtions each of which has a length of 100 timesteps each of which is a 248-dimensional feature vector (Input). Aligning with this batch, Y_dev-mismatch
of shape (2000, 100, 4)
has the corresponding output 4-dimensional variables (Output).
Besides raw data, the above h5 file also contains several objects that may be needed for further work, namely:
inp_stddims
/outp_stddims
are lists of 0-based indexes of input/output variable to which standardization has been applied.inp_stdscalerobj
/outp_stdscalerobj
are pickled StandardScaler objects that were used to standardize the designated dimensions in the input/output. An example of loading one of these:
from sklearn.preprocessing import StandardScaler
import pickle
scy = pickle.loads(h5f['outp_stdscalerobj'].value)
This object can then be used to undo the standardization or to apply same standardization to additional data if/when needed.hash_*
objects are lists with hash IDs identifying the individual simulations in each partition. The IDs are consistent with those in the *.json
files as well as the simulation folders. fea_descr
is a specification of the type and dimensionality of the input feature variables. It is a list:
[["discrete", 3], ["discrete", 25], ["discrete", 26], ["vector", 45], ["vector", 200, "std"]]
indicating that the first dimension in X (["discrete", 3]
) is a discrete (categorical) variable that can assume 3 values (0, 1, 2). The second and third dimensions of X are again categorical with cardinalities 25 and 26, resp. These are followed by a 45-dimensional real-valued vector (["vector", 45]
), and another 200-dimensional vector, which is standardized (["vector", 200, "std"]
).In the final version of release we will show an example of reading the X/Y tensors and connecting them to a learnable embedding within a Deep Neural Network.
Similarly, here is a break-down of the SPE9-MAX dataset. Due to its simpler nature (no mis/match differentiation), there are fewer partitions in this file. However, a repartitioning can be performed as needed.
# list contents and their shapes
with h5.File(maxfn) as f:
for i in list(f):
print("%s %s" % (i, str(f[i].shape)))
To summarize, the input for a single time step of a simulation is a 248 dimensional vector encoding actions and the geography and the corresponding output is 4 yield rates (oil, gas, water and water injection).
In this section, we will visualize the action sequence and production rates sequence of the sample dataset. We want to investigate whether location is an important factor in explaining the differences of output sequence.
This visualization will map the locations of producer well and injector well by investigating the patterns of well locations in match
, mismatch
and entire
regions.
def visualize_SPE9Actions(X, which_sim):
prodl = "producer well"
injl = "injector well"
s = X[which_sim, :, 0:3]
plt.figure(figsize=(5,5))
for i in range(len(s)):
w, x, y = s[i]
if w==0:
style = "go"
l = None
elif w==1:
style = "r+"
l = prodl
prodl = None
else:
style = "b*"
l = injl
injl = None
plt.plot(x, y, style, label=l)
plt.legend(loc=0)
plt.title('Reservoir surface view (simulation %d)' % which_sim)
plt.xlabel('Surface coordinate X')
plt.ylabel('Surface coordinate Y')
plt.xticks([5, 10, 15, 20, 23])
plt.yticks([5, 10, 15, 20, 24])
# Plot action sequence from the first simulation of the test partition
with h5.File(trianglefn, 'r') as f:
X = f['X_test-match'].value
which_sim = 0
visualize_SPE9Actions(X, which_sim=which_sim)
The above shows 100 drilling actions on the surface of the SPE9 reservoir. Some of these drilled producer wells, others injector wells. Each type fulfils a different function in the reservoir engineering context and so their type as well as location their are an important factor in explaining the output sequence.
As can be seen in the above sample, only one of the triangular areas is covered, in this particular case it is the one matching the training region (we loaded the X_test-match
).
Let's see how a drilling sample from the mismatched region looks like:
# Plot action sequence from the first simulation of the test partition
with h5.File(trianglefn, 'r') as f:
X = f['X_test-mismatch'].value
which_sim = 0
visualize_SPE9Actions(X, which_sim=which_sim)
Clearly, a ML model trained in the matched condition may be challenged by a sample like this where all the actions are located in a region previously unseen. This is precisely the goal of the SPE9-TRIANGLE dataset! (we loaded the X_test-mismatch
)
In contrast, the SPE9-MAX dataset distributes drilling actions over the entire reservoir surface:
# Plot action sequence from the first simulation of the test partition
with h5.File(maxfn, 'r') as f:
X = f['X_test'].value
which_sim = 0
visualize_SPE9Actions(X, which_sim=which_sim)
Please note that while all action sequences are uniformly distributed over their particular region, the distribution of Injectors and Producers obeys further (geological) constraints, namely the water saturation in a given location. Hence, there are distributional patterns of these well types visible in the above samples.
Now, let's look at the corresponding output sequence in match
and mis-match
conditions.
# Define visualize_SPE9PR function
def visualize_SPE9PR(Y, which_simulation):
fig, ax = plt.subplots(Y.shape[-1], 1, figsize=(8,15))
compdescr = ['Oil Prod. Rate', 'Water Prod. Rate', 'Gas Prod. Rate', 'Water Inj. Rate']
Dplt = Y[which_simulation, ...]
t = range(Dplt.shape[0])
for c in range(Dplt.shape[-1]):
ax[c].plot(t, Dplt[:,c], label='Simulator output', linestyle='-', linewidth=2)
ax[c].set_ylabel(compdescr[c])
ax[c].grid()
ax[c].legend()
if c == 0:
ax[c].set_title('Simulation %d' % which_simulation)
if c == 3:
ax[c].set_xlabel('Time (90 days period)')
Let us plot the first simulation from the SPE9-TRIANGLE
test set. To display the actual outputs, we also exercise undoing the standardization. We want to compare the output oil rate during the same 90 days time frame.
# Open dataset, load
with h5.File(trianglefn, 'r') as f:
Y = f['Y_test-match'].value
scy = pickle.loads(f['outp_stdscalerobj'].value)
# Undo standardization
Y = Y * scy.scale_ + scy.mean_
# Apply function visualize_SPE9PR
visualize_SPE9PR(Y, 0)
The above are the four output variables of interest, namely the Oil Production Rate, Water Production Rate, Gas Production Rate, and Water Injection Rate, and their time progressions. Each time step represents 90 days of real time. These rates are also referred to as field rates as they are the aggregate measure over all individual wells drilled. In the Aspect 3, we only carry on with field rates, however, the Aspect 2 in the previous notebook data also include the individual well outputs allowing for more refined modeling.
# Open dataset, load
with h5.File(trianglefn, 'r') as f:
Y = f['Y_test-mismatch'].value
scy = pickle.loads(f['outp_stdscalerobj'].value)
# Undo standardization
Y = Y * scy.scale_ + scy.mean_
# Apply function visualize_SPE9PR
visualize_SPE9PR(Y, 0)
Similarly, in the mis-matched region, we compare the same output in 90 days period. We can see that the mismatch
condition follows similar trend as the match
one, but the peaks of four variables appear at different moment depending on individual wells drilled. The differences in distribution of simulator output answer that location is an important factor of oil production rate.
This notebook was created by the Center for Open-Source Data & AI Technologies.
Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.