PubTabNet Dataset

PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables.

The dataset is open sourced by IBM Research Australia and is available to download freely on the IBM Developer Data Asset Exchange.

This notebook can be found on GitHub and Watson Studio.

In [1]:
# importing prerequisites
import sys
import requests
import tarfile
import json
import numpy as np
from os import path
from PIL import Image
from PIL import ImageFont, ImageDraw
from glob import glob
from matplotlib import pyplot as plt
%matplotlib inline

Download and Extract the Dataset

Since the dataset is large (~12GB), here we will be downloading a small subset of the data and extract it.

In [2]:
fname = 'examples.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-pubtabnet/1.0.0/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
Out[2]:
300
In [3]:
# Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()
In [4]:
# Verifying the file was extracted properly
data_path = "examples/"
path.exists(data_path)
Out[4]:
True

Visualizing the Data

In this section, we visualize the raw image and extract it's HTML annotation from the JSON file. We further render the table using Jupyter notebook's inbuilt HTML capabilities.

In [5]:
# Helper function to read in tables from the annotations
import re
from bs4 import BeautifulSoup as bs

def format_html(img):
    ''' Formats HTML code from tokenized annotation of img
    '''
    html_string = '''<html>
                     <head>
                     <meta charset="UTF-8">
                     <style>
                     table, th, td {
                       border: 1px solid black;
                       font-size: 10px;
                     }
                     </style>
                     </head>
                     <body>
                     <table frame="hsides" rules="groups" width="100%%">
                         %s
                     </table>
                     </body>
                     </html>''' % ''.join(img['html']['structure']['tokens'])
    cell_nodes = list(re.finditer(r'(<td[^<>]*>)(</td>)', html_string))
    assert len(cell_nodes) == len(img['html']['cells']), 'Number of cells defined in tags does not match the length of cells'
    cells = [''.join(c['tokens']) for c in img['html']['cells']]
    offset = 0
    for n, cell in zip(cell_nodes, cells):
        html_string = html_string[:n.end(1) + offset] + cell + html_string[n.start(2) + offset:]
        offset += len(cell)
    # prettify the html
    soup = bs(html_string)
    html_string = soup.prettify()
    return html_string
In [6]:
# Loading the json annotations
with open('examples/PubTabNet_Examples.json', 'r') as fp:
    annotations = json.load(fp)
In [7]:
# Inspecting the annotations
annotations.keys()
Out[7]:
dict_keys(['images', 'dataset'])
In [8]:
annotations['images'][0].keys()
Out[8]:
dict_keys(['filename', 'split', 'imgid', 'html', 'sentids'])
In [9]:
annotations['images'][0]
Out[9]:
{'filename': 'PMC4840965_004_00.png',
 'split': 'train',
 'imgid': 0,
 'html': {'cells': [{'tokens': ['<b>',
     'V',
     'a',
     'r',
     'i',
     'a',
     'b',
     'l',
     'e',
     '</b>']},
   {'tokens': ['<b>',
     'H',
     'a',
     'z',
     'a',
     'r',
     'd',
     ' ',
     'r',
     'a',
     't',
     'i',
     'o',
     '</b>']},
   {'tokens': ['<b>', '9', '5', ' ', '%', ' ', 'C', 'I', '</b>']},
   {'tokens': ['<b>',
     '<i>',
     'p',
     '</i>',
     ' ',
     'v',
     'a',
     'l',
     'u',
     'e',
     '*',
     '</b>']},
   {'tokens': ['A', 'g', 'e', ' ', '(', 'm', 'e', 'd', 'i', 'a', 'n', ')']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '7', '1', '6']},
   {'tokens': [' ', '≤', '6', '9']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ', '>', '6', '9']},
   {'tokens': ['0', '.', '8', '3', '9']},
   {'tokens': ['0', '.', '3', '1', '0', '–', '2', '.', '2', '6', '8']},
   {'tokens': []},
   {'tokens': ['G', 'e', 'n', 'd', 'e', 'r']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '1', '4', '2']},
   {'tokens': [' ', 'M', 'a', 'l', 'e']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ', 'F', 'e', 'm', 'a', 'l', 'e']},
   {'tokens': ['0', '.', '4', '2', '6']},
   {'tokens': ['0', '.', '1', '5', '2', '–', '1', '.', '1', '9', '0']},
   {'tokens': []},
   {'tokens': ['T',
     'y',
     'p',
     'e',
     ' ',
     'o',
     'f',
     ' ',
     's',
     'u',
     'r',
     'g',
     'e',
     'r',
     'y']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '0', '1', '0']},
   {'tokens': [' ',
     'L',
     'o',
     'w',
     ' ',
     'a',
     'n',
     't',
     'e',
     'r',
     'i',
     'o',
     'r',
     ' ',
     'r',
     'e',
     's',
     'e',
     'c',
     't',
     'i',
     'o',
     'n']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ',
     'A',
     'b',
     'd',
     'o',
     'm',
     'i',
     'n',
     'o',
     'p',
     'e',
     'r',
     'i',
     'n',
     'e',
     'a',
     'l',
     ' ',
     'r',
     'e',
     's',
     'e',
     'c',
     't',
     'i',
     'o',
     'n']},
   {'tokens': ['3', '.', '1', '4', '0']},
   {'tokens': ['0', '.', '9', '1', '9', '–', '1', '0', '.', '7', '2', '5']},
   {'tokens': []},
   {'tokens': ['T',
     'u',
     'm',
     'o',
     'r',
     ' ',
     'l',
     'o',
     'c',
     'a',
     't',
     'i',
     'o',
     'n']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '7', '1', '0']},
   {'tokens': [' ',
     'U',
     'p',
     'p',
     'e',
     'r',
     ' ',
     'r',
     'e',
     'c',
     't',
     'u',
     'm']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ',
     'M',
     'i',
     'd',
     'd',
     'l',
     'e',
     ' ',
     'r',
     'e',
     'c',
     't',
     'u',
     'm']},
   {'tokens': ['1', '.', '2', '6', '7']},
   {'tokens': ['0', '.', '3', '8', '1', '–', '4', '.', '2', '1', '3']},
   {'tokens': []},
   {'tokens': [' ', 'L', 'o', 'w', ' ', 'r', 'e', 'c', 't', 'u', 'm']},
   {'tokens': ['1', '.', '7', '1', '6']},
   {'tokens': ['0', '.', '4', '1', '9', '–', '7', '.', '0', '2', '6']},
   {'tokens': []},
   {'tokens': ['G',
     'r',
     'a',
     'd',
     'e',
     ' ',
     'o',
     'f',
     ' ',
     'd',
     'i',
     'f',
     'f',
     'e',
     'r',
     'e',
     'n',
     't',
     'i',
     'a',
     't',
     'i',
     'o',
     'n']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '9', '3', '6']},
   {'tokens': [' ', 'G', '1']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ', 'G', '2']},
   {'tokens': ['1', '.', '9', '3', '3']},
   {'tokens': ['0', '.', '4', '1', '6', '–', '3', '.', '4', '2', '3']},
   {'tokens': []},
   {'tokens': [' ', 'G', '3']},
   {'tokens': ['1', '.', '1', '1', '9']},
   {'tokens': ['0', '.', '1', '3', '7', '–', '9', '.', '1', '3', '7']},
   {'tokens': []},
   {'tokens': ['H',
     'i',
     's',
     't',
     'o',
     'l',
     'o',
     'g',
     'i',
     'c',
     ' ',
     't',
     'y',
     'p',
     'e']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '2', '9', '9']},
   {'tokens': [' ',
     'A',
     'd',
     'e',
     'n',
     'o',
     'c',
     'a',
     'r',
     'c',
     'i',
     'n',
     'o',
     'm',
     'a']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ',
     'A',
     'd',
     'e',
     'n',
     'o',
     'c',
     'a',
     'r',
     'c',
     'i',
     'n',
     'o',
     'm',
     'a',
     ' ',
     'w',
     'i',
     't',
     'h',
     ' ',
     'm',
     'u',
     'c',
     'i',
     'n',
     'o',
     'u',
     's',
     ' ',
     'f',
     'e',
     'a',
     't',
     'u',
     'r',
     'e',
     's']},
   {'tokens': ['0', '.', '3', '8', '1']},
   {'tokens': ['0', '.', '0', '9', '6', '–', '1', '.', '5', '1', '4']},
   {'tokens': []},
   {'tokens': ['D',
     'e',
     'p',
     't',
     'h',
     ' ',
     'o',
     'f',
     ' ',
     't',
     'u',
     'm',
     'o',
     'r',
     ' ',
     'i',
     'n',
     'v',
     'a',
     's',
     'i',
     'o',
     'n']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '9', '2', '5']},
   {'tokens': [' ', 'T', '3']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ', 'T', '4', 'a']},
   {'tokens': ['0', '.', '9', '1', '9']},
   {'tokens': ['0', '.', '3', '1', '6', '–', '2', '.', '6', '7', '3']},
   {'tokens': []},
   {'tokens': [' ', 'T', '4', 'b']},
   {'tokens': ['0', '.', '7', '4', '5']},
   {'tokens': ['0', '.', '1', '7', '2', '–', '3', '.', '2', '2', '3']},
   {'tokens': []},
   {'tokens': ['T', 'u', 'm', 'o', 'r', ' ', 's', 'i', 'z', 'e']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': ['0', '.', '3', '2', '9']},
   {'tokens': [' ', '≤', '4', ' ', 'c', 'm']},
   {'tokens': ['1', '.', '0', '0', '0']},
   {'tokens': []},
   {'tokens': []},
   {'tokens': [' ', '>', '4', ' ', 'c', 'm']},
   {'tokens': ['0', '.', '5', '9', '4']},
   {'tokens': ['0', '.', '2', '1', '4', '–', '1', '.', '6', '5', '1']},
   {'tokens': []}],
  'imgid': 0,
  'structure': {'tokens': ['<thead>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '</thead>',
    '<tbody>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '<tr>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '<td>',
    '</td>',
    '</tr>',
    '</tbody>']},
  'sentid': 0},
 'sentids': [0]}
In [10]:
# Showing the raw image
from IPython.display import Image as displayImage
filename = annotations['images'][0]['filename']
displayImage("examples/"+filename)
Out[10]:
In [11]:
# Extracting the HTML for the table from the annotation
html_string = format_html(annotations['images'][0])
print(html_string)
<html>
 <head>
  <meta charset="utf-8"/>
  <style>
   table, th, td {
                       border: 1px solid black;
                       font-size: 10px;
                     }
  </style>
 </head>
 <body>
  <table frame="hsides" rules="groups" width="100%">
   <thead>
    <tr>
     <td>
      <b>
       Variable
      </b>
     </td>
     <td>
      <b>
       Hazard ratio
      </b>
     </td>
     <td>
      <b>
       95 % CI
      </b>
     </td>
     <td>
      <b>
       <i>
        p
       </i>
       value*
      </b>
     </td>
    </tr>
   </thead>
   <tbody>
    <tr>
     <td>
      Age (median)
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.716
     </td>
    </tr>
    <tr>
     <td>
      ≤69
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      &gt;69
     </td>
     <td>
      0.839
     </td>
     <td>
      0.310–2.268
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Gender
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.142
     </td>
    </tr>
    <tr>
     <td>
      Male
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Female
     </td>
     <td>
      0.426
     </td>
     <td>
      0.152–1.190
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Type of surgery
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.010
     </td>
    </tr>
    <tr>
     <td>
      Low anterior resection
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Abdominoperineal resection
     </td>
     <td>
      3.140
     </td>
     <td>
      0.919–10.725
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Tumor location
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.710
     </td>
    </tr>
    <tr>
     <td>
      Upper rectum
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Middle rectum
     </td>
     <td>
      1.267
     </td>
     <td>
      0.381–4.213
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Low rectum
     </td>
     <td>
      1.716
     </td>
     <td>
      0.419–7.026
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Grade of differentiation
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.936
     </td>
    </tr>
    <tr>
     <td>
      G1
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      G2
     </td>
     <td>
      1.933
     </td>
     <td>
      0.416–3.423
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      G3
     </td>
     <td>
      1.119
     </td>
     <td>
      0.137–9.137
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Histologic type
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.299
     </td>
    </tr>
    <tr>
     <td>
      Adenocarcinoma
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Adenocarcinoma with mucinous features
     </td>
     <td>
      0.381
     </td>
     <td>
      0.096–1.514
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Depth of tumor invasion
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.925
     </td>
    </tr>
    <tr>
     <td>
      T3
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      T4a
     </td>
     <td>
      0.919
     </td>
     <td>
      0.316–2.673
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      T4b
     </td>
     <td>
      0.745
     </td>
     <td>
      0.172–3.223
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      Tumor size
     </td>
     <td>
     </td>
     <td>
     </td>
     <td>
      0.329
     </td>
    </tr>
    <tr>
     <td>
      ≤4 cm
     </td>
     <td>
      1.000
     </td>
     <td>
     </td>
     <td>
     </td>
    </tr>
    <tr>
     <td>
      &gt;4 cm
     </td>
     <td>
      0.594
     </td>
     <td>
      0.214–1.651
     </td>
     <td>
     </td>
    </tr>
   </tbody>
  </table>
 </body>
</html>
In [12]:
# Rendering the above HTML in Jupyter Notebook for a more readable format
from IPython.core.display import display, HTML
display(HTML(html_string))
Variable Hazard ratio 95 % CI p value*
Age (median) 0.716
≤69 1.000
>69 0.839 0.310–2.268
Gender 0.142
Male 1.000
Female 0.426 0.152–1.190
Type of surgery 0.010
Low anterior resection 1.000
Abdominoperineal resection 3.140 0.919–10.725
Tumor location 0.710
Upper rectum 1.000
Middle rectum 1.267 0.381–4.213
Low rectum 1.716 0.419–7.026
Grade of differentiation 0.936
G1 1.000
G2 1.933 0.416–3.423
G3 1.119 0.137–9.137
Histologic type 0.299
Adenocarcinoma 1.000
Adenocarcinoma with mucinous features 0.381 0.096–1.514
Depth of tumor invasion 0.925
T3 1.000
T4a 0.919 0.316–2.673
T4b 0.745 0.172–3.223
Tumor size 0.329
≤4 cm 1.000
>4 cm 0.594 0.214–1.651