This notebook relates to the Oil Reservoir Simulations Dataset (ORSD). This dataset contains tens of thousands of simulations generated by a physics-based simulator on a basic oil-reservoir field model. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.
In this notebook, we will discuss what an oil reservoir model is and what it means to run a simulation on such a model. We will also take a look at the types of files included in this dataset and explore some of the features of the raw data.
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
Install and import the required packages:
Download from the Dataset download link: here. Note: It can take a while to download as the zip file is 3.7GB.
Upload the following data assets from the /data
folder in the uncompressed zip folder by clicking the 0100 button at the top right of the toolbar:
ddddf3a7-a68a-4502-83fc-4e80c4d5fa37.json
ddddf3a7-a68a-4502-83fc-4e80c4d5fa37.sch
SPE9-geology.h5
SPE9-geology.normed.h5
SPE9-MAX.Aspect3.compressed.h5
SPE9-TRIANGLE.Aspect3.compressed.h5
# Installing packages needed for data processing and visualization
!pip install numpy pandas sklearn matplotlib h5py
# Define required imports
import io
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
from sklearn.utils.extmath import cartesian
import h5py as h5
In this brief section, we provide context on what it means to create an oil reservoir model and to then run a simulation on it. This information is geared towards users who may be unfamiliar with the oil reservoir simulation domain. If you already have experience with this domain, feel free to skip to the next section.
Suppose there is a sizable geological region that is of interest for exploration. A reservoir model (RM) is a space-discretized model capturing essential geological properties of that region (e.g. rock porosity and permeability). A RM is a large collection of small grid cells within which the geological properties are assumed to be constant. RMs are created using methods combining statistical analysis, geology, and manual geological expertise.
The simulations contained in the ORSD were generated based upon a publicly available RM termed SPE9 - a model designed for research purposes in the area of reservoir simulation. This RM has grid dimensions 24x25x15. Detailed information about the SPE9 model can be found here and the RM can be downloaded from here.
A simulator is a mathematical software operating on a RM to predict consequences of various actions enacted on the reservoir. At its core, the simulator is a Partial Differential Equation (PDE) solver, solving nonlinear equations describing the physics of flow of fluids through a "porous medium" (rock).
All simulations on the SPE9 RM in the ORSD were generated using a publicly available, open-source reservoir simulator termed Open Porous Media (OPM). The simulator can be obtained here.
The individual simulations in the ORSD collection differ in the action sequences as input to the simulator thus resulting in differing output sequences. The primary goal of the ORSD is to develop machine learning models that can learn from certain amounts of examples of such sequences to predict the output of the OPM simulator. Why? The physics-based simulation is computationally expensive. While on the SPE9 RM one simulation can take an order of minutes, on larger RMs occurring in the field the simulation time can become hours. An acceleration of this process bears the promise of enabling new techniques for reservoir exploration, such as optimization of drilling schedules and operations. A paper presenting an approach toward such an acceleration based of deep neural networks can be retrieved here.
In summary, traditional oil reservoir simulation requires using a computationally expensive simulator to generate output sequences (oil, water, and gas production/injection rates) given input sequences (injector and producer well features). This dataset can be used to train a deep learning sequence-to-sequence model to speed up the output sequence generation step by predicting rather than simulating the output sequences.
In this section, we show how the simulations provided in the ORSD can be used with sequence-to-sequence deep learning models. We then look at how the ORSD is partitioned, and discuss the intended uses of each of these partitions.
The ORSD data collection contains roughly 60,000 simulations generated by the OPM simulator on the SPE9 RM. Each simulation comprises an input action sequence and an output prediction sequence. In a machine learning context, the aim is to train and develop machine-learning models so they can accurately predict the output sequence, given an input sequence.
Here is an illustrative example of one simulation:
The intention of this dataset is to cater to researchers across various fields, in particular, to:
Each group may prefer the data in a different form in terms of their relationship to the original task (physics-base simulation). Therefore the data is organized in three Aspects as follows:
Furthermore, the ORSD splits the simulations into two separate datasets of roughly 30,000 simulations each, namely:
In this section we take a closer look at the data in Aspects 1 and 2 of the ORSD. Let's first however list out all of the files and assets included in this project to get a better sense of what we will be working with.
# Extract a sorted list of all assets associated with this project
sorted([d['name'] for d in project.get_assets()])
For each simulation, Aspect 1 includes the individual files that are input to the physics-based simulator (OPM). These files may be helpful to researchers who are interested in re-running the actual physics simulations themselves. They are provided in the format required by the simulator software (.sch
) and may be accessed by unarchiving either SPE9-TRIANGLE.Aspect1.tgz
or SPE9-MAX.Aspect1.tgz
. One such file per simulation is included and labeled by a unique hash code.
As a sample: lets check out the file ddddf3a7-a68a-4502-83fc-4e80c4d5fa37.sch
provided with this project. To help us access text files in this notebook however, we first create a function that helps load data assets into memory.
# Function to load data asset into notebook
def load_data_asset(data_asset_name):
"""
Loads a data asset
:param data_asset_name: filename of desired text data asset
:returns: data asset as TextIOWrapper object
"""
r = project.get_file(data_asset_name)
if isinstance(r, list):
bio = [ handle['file_content'] for handle in r if handle['data_file_title'] == data_asset_name][0]
bio.seek(0)
return io.TextIOWrapper(bio, encoding='utf-8')
else:
r.seek(0)
return io.TextIOWrapper(r, encoding='utf-8')
Now we can read in the sample Aspect 1 file. To get a sense for what this file contains, we print a small portion of the file. As you can see from the output of the cell below, the file is structured as a set of parameters that define an oil reservoir model in a manner in which the OPM simulator software can interpret. Parameters are grouped to represent a series of wells to be drilled (input sequence) in the SPE9 RM. The group of parameters printed below represents the first well to be drilled in this simulation.
# Read in a sample Aspect 1 data file and preview the top of the file
a1_sample = load_data_asset('ddddf3a7-a68a-4502-83fc-4e80c4d5fa37.sch')
print(a1_sample.read()[:216])
Here are what each of these parameters represent:
For more information on how to set up a simulation that can be run and solved by the OPM simulator software, check out the latest version of their documentation here.
Lets now take a look at what data is included in Aspect 2 of the ORSD. Aspect 2 data is essentially a combination of the input and output sequences, as generated by the OPM simulator software, however converted into a JSON format. The benefit of this option (to an ML researcher) is the abstraction from the simulator software and its proprietary binary data formats. In order to use the raw data in an ML experiment, the researcher will need to pre-process and encode the input/output, perform partitioning, etc. Note that Aspect 3 (explored in subsequent notebooks) includes pre-processing.
The inputs are sequences of drilling actions, whereby an action is a tuple (drill_type, drill_location-x, drill_location-y, drill_control)
. The outputs are sequences of production rates (4-dimensional). The raw data in each of the two datasets consist of 30,000 simulations, each stored in a separate json file named so as to contain the unique hash code consistent with Aspect 1 coding.
The following code cell's output contains a single example of a simulation and its raw form.
# Read in a sample Aspect 2 data file and preview its contents
a2_sample = load_data_asset('ddddf3a7-a68a-4502-83fc-4e80c4d5fa37.json')
a2_sample = json.load(a2_sample)
list(a2_sample.keys())
Here are what each of the keys represent:
action_dict_list
: contains the input sequence of actions. modification_time
: if and when the data file was last modified, almost always N/A
.realization_id
: is an integer identifying a geological variant. In the dataset, there is only one geological variant, hence this integer is always 0
.summary_rates
: contains the output sequence of production rates. Note that the outputs are labeled by each individual well and type of well. unique_label
: is a hash code used to identify this particular simulation in experiments. In the following code cell, we dig deeper into action_dict_list
to peak at how a sample of wells are defined in the input sequence. Note, a "producer" well is a primary recovery well that extracts oil and also gas and water from a reservoir. An "injector" well is a secondary recovery well that injects a substance, typically water, into a reservoir in order to sustain the reservoir's pressure as it is depleted. Injector wells can yeild an additional 10-40% oil recovery from the reservoir greatly enhancing the productivity and economics of the development.
# Peak at a sample "producer" and "injector" well
a2_sample['action_dict_list'][:2]
New features may be engineered upon Aspect 2 of the ORSD. In order to help with this task, SPE9-geology.h5
and SPE9-geology.normed.h5
are included with the dataset. These files represent 4-dimensional tensors containing geological features extracted from the SPE9 reservoir model. These features include rock permeability in all three spatial dimensions (PERMX, PERMY, PERMZ)
and rock porosity (PORO)
. SPE9-geology.normed.h5
simply represents a standardized version of the geology data in contrast to SPE9-geology.h5
. Features created from this data are already included in Aspect 3 and will be explored in subsequent notebooks. For example, at a given drill coordinate (X,Y), the vertical geology (column for all Z, subject to drilling depth) was appended as a feature vector.
Lets now take some time to visualize the data included in SPE9-geology.normed.h5
. We begin by creating a function that will take in a slice of the tensor and visualizing the features it represents.
def show_SPE9_geology(gamma, title, which_norm=None, dotsize=20):
"""
Visualizes 3-dimensional geological feature data extracted from the SPE9 reservoir model
:param gamma: copy of 3-dimensional geology data to visualize
:param title: title to print for matplotlib graph
:param which_norm: if/how the data is normalized
:param dotsize: size to visualize datapoints
"""
G = gamma.shape
z = np.arange(G[0])
y = np.arange(G[1])
x = np.arange(G[2])
I0 = cartesian((z, y, x))
c0 = [gamma[tuple(I0[i])] for i in range(len(I0))]
plt.rcParams['figure.figsize'] = (10, 8)
fig = plt.figure()
plt.clf()
ax1 = fig.add_subplot(111, projection='3d')
# set darker background
plt.gca().patch.set_facecolor('white')
if which_norm is not None and which_norm == 'log':
cum_which_norm = matplotlib.colors.LogNorm
else:
cum_which_norm = matplotlib.colors.Normalize
cax = ax1.scatter(I0[:, 0], I0[:, 1], I0[:, 2], c=c0, cmap='inferno', marker='s', s=dotsize, norm=cum_which_norm())
ax1.w_xaxis.set_pane_color((0.8, 0.8, 0.8, 1.0))
ax1.w_yaxis.set_pane_color((0.8, 0.8, 0.8, 1.0))
ax1.w_zaxis.set_pane_color((0.8, 0.8, 0.8, 1.0))
cbar = fig.colorbar(cax)
plt.title(title)
plt.show()
Now lets read in the geology tensor stored as an hdf5
file, and extract all the datasets inside to a Python dictionary.
# Extract datasets stored in SPE9-geology.normed.h5 to a dictionary
geology = project.get_file('SPE9-geology.normed.h5')
G = {}
with h5.File(geology, 'r') as f:
for k in f:
G[k] = f[k].value
Lets peak at the shape of one of the geological features extracted. Notice how the shape matches the grid dimensions of our SPE9 reservoir model.
G['PORO'].shape
Now lets use our show_SPE9_geology
function to visualize rock porosity and then rock permeability along the X
dimension of our model.
Note that, porosity is a value, from 0 to 1, that represents the fraction of void space inside the rock that is filled with fluids. 0 means this is no space for fluids inside the rock, while 1 means there is no rock at all and the space is purely occupied by fluids. Similarly, X permeability represents how easy is it for the fluids to flow across the rock in the x direction, and is measured in darcy units.
Notice how rock porosity only varies along the Z
dimension of our model, while permiability seems to vary across all dimensions of our model.
# Rock porosity
show_SPE9_geology(G['PORO'][0], 'PORO', which_norm='log', dotsize=50)
# Rock permeability along X
show_SPE9_geology(G['PERMX'][0], 'PERMX', which_norm='log', dotsize=50)
Part 2 - Data visualization
notebook to explore Aspect 3 of the ORSD. This notebook was created by the Center for Open-Source Data & AI Technologies.
Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.