FinTabNet Dataset

FinTabNet is a large dataset of document images from financial earnings of Fortune500 companies.

The dataset is open sourced by IBM Research and is available to download freely on the IBM Developer Data Asset Exchange.

This notebook can be found on Watson Studio.

Download and Extract the Dataset

In this section, we will download and extract the dataset. We will also install dependencies as needed.

Installing PDF-to-image utils

For this notebook we will install 2 PyPi packages pdf2image and pypdf2 as shown in the cell below. Additionally for all cells to work, we will also need a system wide package called poppler. This needs to be installed for the Watson Studio environment. Please follow the instructions here to install a custom conda package. Specifically we need to install the conda package poppler.

In [1]:
!pip install pdf2image
!pip install pypdf2
Requirement already satisfied: pdf2image in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.14.0)
Requirement already satisfied: pillow in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from pdf2image) (7.2.0)
Requirement already satisfied: pypdf2 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.26.0)
In [2]:
# importing prerequisites
import sys
import requests
import tarfile
import json
import numpy as np
import pdf2image
from os import path
from PIL import Image
from PIL import ImageFont, ImageDraw
from glob import glob
from matplotlib import pyplot as plt
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader
from IPython.core.display import display, HTML
import pdb
import copy

%matplotlib inline
In [3]:
fname = 'examples.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-fintabnet/1.0.0/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
Out[3]:
639829
In [4]:
#Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()
In [5]:
# Verifying the file was extracted properly
data_path = "examples/"
path.exists(data_path)
Out[5]:
True

Visualizing the Data

In this section, we visualize the annotations file by overlaying it on the underlying image. We also display the HTML structure for each table to compare.

In [6]:
# Define color code
colors = [(255, 0, 0),(0, 255, 0)]
categories = ["table", "cell"]
In [7]:
# Function to viz the annotation
def markup(image, annotations, pdf_height):
    ''' Draws the segmentation, bounding box, and label of each annotation
    '''
    draw = ImageDraw.Draw(image, 'RGBA')
    for annotation in annotations:
        # Draw bbox
        orig_annotation = copy.copy(annotation['bbox'])
        annotation['bbox'][3] = pdf_height-orig_annotation[1]
        annotation['bbox'][1] = pdf_height-orig_annotation[3]
        draw.rectangle(
            (annotation['bbox'][0],
             annotation['bbox'][1],
             annotation['bbox'][2],
             annotation['bbox'][3]),
            outline=colors[annotation['category_id'] - 1] + (255,),
            width=2
        )
        # Draw label
        w, h = draw.textsize(text=categories[annotation['category_id'] - 1])
        if annotation['bbox'][3] < h:
            draw.rectangle(
                (annotation['bbox'][2],
                 annotation['bbox'][1],
                 annotation['bbox'][2] + w,
                 annotation['bbox'][1] + h),
                fill=(64, 64, 64, 255)
            )
            draw.text(
                (annotation['bbox'][2],
                 annotation['bbox'][1]),
                text=categories[annotation['category_id'] - 1],
                fill=(255, 255, 255, 255)
            )
        else:
            draw.rectangle(
                (annotation['bbox'][0]-w,
                 annotation['bbox'][1]-h,
                 annotation['bbox'][0],
                 annotation['bbox'][1]),
                fill=(64, 64, 64, 255)
            )
            draw.text(
                (annotation['bbox'][0]-w,
                 annotation['bbox'][1]-h),
                text=categories[annotation['category_id'] - 1],
                fill=(255, 255, 255, 255)
            )
    return np.array(image)
In [8]:
# Parse the JSON file and read all the images and labels
with open('examples/FinTabNet_1.0.0_table_example.jsonl', 'r') as fp:
    images = {}
    for line in fp:
        sample = json.loads(line)
        # Index images
        if sample['filename'] in images:
            annotations = images[sample['filename']]["annotations"]
            html = images[sample['filename']]["html"]
        else:
            annotations = []
            html = ""
        for t, token in enumerate(sample["html"]["cells"]):
            if "bbox" in token:
                annotations.append({"category_id":2, "bbox": token["bbox"]})
        #Build html table
        cnt = 0
        for t, token in enumerate(sample["html"]["structure"]["tokens"]):
            html += token
            if token=="<td>":
                html += "".join(sample["html"]["cells"][cnt]["tokens"])
                cnt += 1
        annotations.append({"category_id": 1, "bbox": sample["bbox"]})
        images[sample['filename']] = {'filepath': 'examples/pdf/' + sample["filename"], 'html': html, 'annotations': annotations}
In [9]:
# Visualize annotations and print HTML tables
import matplotlib
matplotlib.rcParams['figure.dpi'] = 250
for i, (filename, image) in enumerate(images.items()):
    pdf_page = PdfFileReader(open(image["filepath"], 'rb')).getPage(0)
    pdf_shape = pdf_page.mediaBox
    pdf_height = pdf_shape[3]-pdf_shape[1]
    pdf_width = pdf_shape[2]-pdf_shape[0]
    converted_images = convert_from_path(image["filepath"], size=(pdf_width, pdf_height))
    img = converted_images[0]
    print("Table HTML for page #{}".format(i))
    display(HTML(image['html']))
    plt.figure()
    plt.imshow(markup(img, image['annotations'], pdf_height))
    plt.title("Page # {}".format(i))
    plt.axis('off')
    
Table HTML for page #0
201720162015
Minimum rentals$2,814$2,394$2,249
Contingent rentals(1)178214194
$2,992$2,608$2,443
Operating LeasesAircraftand RelatedEquipmentFacilitiesand Other
TotalOperatingLeases2018$398$2,047
$2,44520193431,887
2,23020202611,670
1,93120212031,506
1,70920221851,355
1,540Thereafter1757,844
8,019Total$1,565$16,309
Table HTML for page #1
Amount Reclassified from AOCI
Affected Line Item in theIncome Statement201720162015
Amortization of retirement plans prior servicecredits, before tax$120$121$115
Salaries and employee benefitsIncome tax benefit(44)(45)(43)
Provision for income taxesAOCI reclassifications, net of tax$76$76$72
201720162015
Foreign currency translation gain (loss):
Balance at beginning of period$(514)$(253)$81
Translation adjustments(171)(261)(334)
Balance at end of period(685)(514)(253)
Retirement plans adjustments:
Balance at beginning of period345425425
Prior service credit and other arising during period1(4)72
Reclassifications from AOCI(76)(76)(72)
Balance at end of period270345425
Accumulated other comprehensive (loss) income at end of period$(415)$(169)$172
Table HTML for page #2
2018$81
201971
202055
202144
202241
20172016GrossCarryingAmountAccumulated AmortizationNet BookValueGrossCarryingAmount
Accumulated AmortizationNet BookValueCustomer relationships$656$(203)$453$912
$(156)$756Technology54(26)28123
(16)107Trademarks and other136(88)48202
(57)145Total$846$(317)$529$1,237
20172016
Accrued Salaries and Employee Benefits
Salaries$431$478
Employee benefits, including variable compensation781804
Compensated absences702690
$1,914$1,972
Accrued Expenses
Self-insurance accruals$976$837
Taxes other than income taxes283311
Other1,9711,915
$3,230$3,063
Table HTML for page #3
(in millions, except per share amounts)FirstQuarterSecondQuarterThirdQuarterFourth Quarter
2017(1)
Revenues$14,663$14,931$14,997$15,728
Operating income1,2641,1671,0251,581
Net income7157005621,020
Basic earnings per common share(2)2.692.632.113.81
Diluted earnings per common share(2)2.652.592.073.75
2016(3)
Revenues$12,279$12,453$12,654$12,979
Operating income (loss)1,1441,137864(68)
Net income (loss)692691507(70)
Basic earnings (loss) per common share(2)2.452.471.86(0.26)
Diluted earnings (loss) per common share(2)2.422.441.84(0.26)
Table HTML for page #4
201720162015
Low3.25%2.75%4.50%
High4.504.507.00
Weighted-average4.033.825.90
Table HTML for page #5
Aircraft andAircraft RelatedOther(1)Total
2018$1,777$1,440$3,217
20191,7295082,237
20201,9334002,333
20211,3413091,650
20221,2761981,474
Thereafter2,8954993,394
Total$10,951$3,354$14,305
B767FB777FTotal
201814418
201915217
202016319
202110313
202210414
Thereafter6-6
Total711687
Table HTML for page #6
2017Percent of Revenue 2017
Revenues$7,401100.0%
Operating expenses:
Salaries and employee benefits2,07728.1
Purchased transportation3,04941.2
Rentals3534.8
Depreciation and amortization2393.2
Fuel2253.1
Maintenance and repairs1431.9
Intercompany charges170.2
Other1,21416.4
Total operating expenses7,31798.9%
Operating income$84
Operating margin1.1%
Package:
Average daily packages1,022
Revenue per package (yield)$24.77
Freight:
Average daily pounds3,608
Revenue per pound (yield)$0.56
Table HTML for page #7
20172016TotalNumber ofSharesPurchasedAveragePrice Paidper ShareTotalPurchasePriceTotalNumber ofSharesPurchased
AveragePrice Paidper ShareTotalPurchasePriceCommon stock repurchases2,955,000$172.13$50918,225,000
Table HTML for page #8
20172016
Funded Status of Plans:
Projected benefit obligation (PBO)$29,913$29,602
Fair value of plan assets26,31224,271
Funded status of the plans$(3,601)$(5,331)
Cash Amounts:
Cash contributions during the year$2,115$726
Benefit payments during the year$2,310$912
MeasurementDateDiscount Rate
5/31/20174.08%
5/31/20164.13
5/31/20154.42
5/31/20144.60
Table HTML for page #9
Net Book Value at May 31,Range2017
2016Wide-body aircraft and related equipment15 to 30 years$9,103
$8,356Narrow-body and feeder aircraft and related equipment5 to 18 years3,099
3,180Package handling and ground support equipment3 to 30 years3,862
3,249Information technology2 to 10 years1,114
1,051Vehicles3 to 15 years3,400
3,084Facilities and other2 to 40 years5,403
In [ ]: