FinTabNet Dataset¶

FinTabNet is a large dataset of document images from financial earnings of Fortune500 companies.

The dataset is open sourced by IBM Research and is available to download freely on the IBM Developer Data Asset Exchange.

This notebook can be found on Watson Studio.

Download and Extract the Dataset¶

In this section, we will download and extract the dataset. We will also install dependencies as needed.

Installing PDF-to-image utils¶

For this notebook we will install 2 PyPi packages pdf2image and pypdf2 as shown in the cell below. Additionally for all cells to work, we will also need a system wide package called poppler. This needs to be installed for the Watson Studio environment. Please follow the instructions here to install a custom conda package. Specifically we need to install the conda package poppler.

!pip install pdf2image
!pip install pypdf2

Requirement already satisfied: pdf2image in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.14.0)
Requirement already satisfied: pillow in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from pdf2image) (7.2.0)
Requirement already satisfied: pypdf2 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.26.0)

# importing prerequisites
import sys
import requests
import tarfile
import json
import numpy as np
import pdf2image
from os import path
from PIL import Image
from PIL import ImageFont, ImageDraw
from glob import glob
from matplotlib import pyplot as plt
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader
from IPython.core.display import display, HTML
import pdb
import copy

%matplotlib inline

fname = 'examples.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-fintabnet/1.0.0/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

639829

#Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()

# Verifying the file was extracted properly
data_path = "examples/"
path.exists(data_path)

True

Visualizing the Data¶

In this section, we visualize the annotations file by overlaying it on the underlying image. We also display the HTML structure for each table to compare.

# Define color code
colors = [(255, 0, 0),(0, 255, 0)]
categories = ["table", "cell"]

# Function to viz the annotation
def markup(image, annotations, pdf_height):
    ''' Draws the segmentation, bounding box, and label of each annotation
    '''
    draw = ImageDraw.Draw(image, 'RGBA')
    for annotation in annotations:
        # Draw bbox
        orig_annotation = copy.copy(annotation['bbox'])
        annotation['bbox'][3] = pdf_height-orig_annotation[1]
        annotation['bbox'][1] = pdf_height-orig_annotation[3]
        draw.rectangle(
            (annotation['bbox'][0],
             annotation['bbox'][1],
             annotation['bbox'][2],
             annotation['bbox'][3]),
            outline=colors[annotation['category_id'] - 1] + (255,),
            width=2
        )
        # Draw label
        w, h = draw.textsize(text=categories[annotation['category_id'] - 1])
        if annotation['bbox'][3] < h:
            draw.rectangle(
                (annotation['bbox'][2],
                 annotation['bbox'][1],
                 annotation['bbox'][2] + w,
                 annotation['bbox'][1] + h),
                fill=(64, 64, 64, 255)
            )
            draw.text(
                (annotation['bbox'][2],
                 annotation['bbox'][1]),
                text=categories[annotation['category_id'] - 1],
                fill=(255, 255, 255, 255)
            )
        else:
            draw.rectangle(
                (annotation['bbox'][0]-w,
                 annotation['bbox'][1]-h,
                 annotation['bbox'][0],
                 annotation['bbox'][1]),
                fill=(64, 64, 64, 255)
            )
            draw.text(
                (annotation['bbox'][0]-w,
                 annotation['bbox'][1]-h),
                text=categories[annotation['category_id'] - 1],
                fill=(255, 255, 255, 255)
            )
    return np.array(image)

# Parse the JSON file and read all the images and labels
with open('examples/FinTabNet_1.0.0_table_example.jsonl', 'r') as fp:
    images = {}
    for line in fp:
        sample = json.loads(line)
        # Index images
        if sample['filename'] in images:
            annotations = images[sample['filename']]["annotations"]
            html = images[sample['filename']]["html"]
        else:
            annotations = []
            html = ""
        for t, token in enumerate(sample["html"]["cells"]):
            if "bbox" in token:
                annotations.append({"category_id":2, "bbox": token["bbox"]})
        #Build html table
        cnt = 0
        for t, token in enumerate(sample["html"]["structure"]["tokens"]):
            html += token
            if token=="<td>":
                html += "".join(sample["html"]["cells"][cnt]["tokens"])
                cnt += 1
        annotations.append({"category_id": 1, "bbox": sample["bbox"]})
        images[sample['filename']] = {'filepath': 'examples/pdf/' + sample["filename"], 'html': html, 'annotations': annotations}

# Visualize annotations and print HTML tables
import matplotlib
matplotlib.rcParams['figure.dpi'] = 250
for i, (filename, image) in enumerate(images.items()):
    pdf_page = PdfFileReader(open(image["filepath"], 'rb')).getPage(0)
    pdf_shape = pdf_page.mediaBox
    pdf_height = pdf_shape[3]-pdf_shape[1]
    pdf_width = pdf_shape[2]-pdf_shape[0]
    converted_images = convert_from_path(image["filepath"], size=(pdf_width, pdf_height))
    img = converted_images[0]
    print("Table HTML for page #{}".format(i))
    display(HTML(image['html']))
    plt.figure()
    plt.imshow(markup(img, image['annotations'], pdf_height))
    plt.title("Page # {}".format(i))
    plt.axis('off')

Table HTML for page #0

Table HTML for page #1

Table HTML for page #2

Table HTML for page #3

Table HTML for page #4

Table HTML for page #5

Table HTML for page #6

Table HTML for page #7

Table HTML for page #8

Table HTML for page #9

	2017	2016	2015
Minimum rentals	$2,814	$2,394	$2,249
Contingent rentals⁽¹⁾	178	214	194
	$2,992	$2,608	$2,443


Operating Leases		Aircraftand RelatedEquipment	Facilitiesand Other
TotalOperatingLeases	2018	$398	$2,047
$2,445	2019	343	1,887
2,230	2020	261	1,670
1,931	2021	203	1,506
1,709	2022	185	1,355
1,540	Thereafter	175	7,844
8,019	Total	$1,565	$16,309

				Amount Reclassified from AOCI
Affected Line Item in theIncome Statement		2017	2016	2015
	Amortization of retirement plans prior servicecredits, before tax	$120	$121	$115
Salaries and employee benefits	Income tax benefit	(44)	(45)	(43)
Provision for income taxes	AOCI reclassifications, net of tax	$76	$76	$72

	2017	2016	2015
Foreign currency translation gain (loss):
Balance at beginning of period	$(514)	$(253)	$81
Translation adjustments	(171)	(261)	(334)
Balance at end of period	(685)	(514)	(253)
Retirement plans adjustments:
Balance at beginning of period	345	425	425
Prior service credit and other arising during period	1	(4)	72
Reclassifications from AOCI	(76)	(76)	(72)
Balance at end of period	270	345	425
Accumulated other comprehensive (loss) income at end of period	$(415)	$(169)	$172

2018	$81
2019	71
2020	55
2021	44
2022	41


2017	2016		GrossCarryingAmount	Accumulated Amortization	Net BookValue	GrossCarryingAmount
Accumulated Amortization	Net BookValue	Customer relationships	$656	$(203)	$453	$912
$(156)	$756	Technology	54	(26)	28	123
(16)	107	Trademarks and other	136	(88)	48	202
(57)	145	Total	$846	$(317)	$529	$1,237

	2017	2016
Accrued Salaries and Employee Benefits
Salaries	$431	$478
Employee benefits, including variable compensation	781	804
Compensated absences	702	690
	$1,914	$1,972
Accrued Expenses
Self-insurance accruals	$976	$837
Taxes other than income taxes	283	311
Other	1,971	1,915
	$3,230	$3,063

(in millions, except per share amounts)	FirstQuarter	SecondQuarter	ThirdQuarter	Fourth Quarter
2017⁽¹⁾
Revenues	$14,663	$14,931	$14,997	$15,728
Operating income	1,264	1,167	1,025	1,581
Net income	715	700	562	1,020
Basic earnings per common share⁽²⁾	2.69	2.63	2.11	3.81
Diluted earnings per common share⁽²⁾	2.65	2.59	2.07	3.75
2016⁽³⁾
Revenues	$12,279	$12,453	$12,654	$12,979
Operating income (loss)	1,144	1,137	864	(68)
Net income (loss)	692	691	507	(70)
Basic earnings (loss) per common share⁽²⁾	2.45	2.47	1.86	(0.26)
Diluted earnings (loss) per common share⁽²⁾	2.42	2.44	1.84	(0.26)

	Aircraft andAircraft Related	Other⁽¹⁾	Total
2018	$1,777	$1,440	$3,217
2019	1,729	508	2,237
2020	1,933	400	2,333
2021	1,341	309	1,650
2022	1,276	198	1,474
Thereafter	2,895	499	3,394
Total	$10,951	$3,354	$14,305

	B767F	B777F	Total
2018	14	4	18
2019	15	2	17
2020	16	3	19
2021	10	3	13
2022	10	4	14
Thereafter	6	-	6
Total	71	16	87

	2017	Percent of Revenue 2017
Revenues	$7,401	100.0%
Operating expenses:
Salaries and employee benefits	2,077	28.1
Purchased transportation	3,049	41.2
Rentals	353	4.8
Depreciation and amortization	239	3.2
Fuel	225	3.1
Maintenance and repairs	143	1.9
Intercompany charges	17	0.2
Other	1,214	16.4
Total operating expenses	7,317	98.9%
Operating income	$84
Operating margin	1.1%
Package:
Average daily packages	1,022
Revenue per package (yield)	$24.77
Freight:
Average daily pounds	3,608
Revenue per pound (yield)	$0.56


2017	2016		TotalNumber ofSharesPurchased	AveragePrice Paidper Share	TotalPurchasePrice	TotalNumber ofSharesPurchased
AveragePrice Paidper Share	TotalPurchasePrice	Common stock repurchases	2,955,000	$172.13	$509	18,225,000

	2017	2016
Funded Status of Plans:
Projected benefit obligation (PBO)	$29,913	$29,602
Fair value of plan assets	26,312	24,271
Funded status of the plans	$(3,601)	$(5,331)
Cash Amounts:
Cash contributions during the year	$2,115	$726
Benefit payments during the year	$2,310	$912

MeasurementDate	Discount Rate
5/31/2017	4.08%
5/31/2016	4.13
5/31/2015	4.42
5/31/2014	4.60


Net Book Value at May 31,		Range	2017
2016	Wide-body aircraft and related equipment	15 to 30 years	$9,103
$8,356	Narrow-body and feeder aircraft and related equipment	5 to 18 years	3,099
3,180	Package handling and ground support equipment	3 to 30 years	3,862
3,249	Information technology	2 to 10 years	1,114
1,051	Vehicles	3 to 15 years	3,400
3,084	Facilities and other	2 to 40 years	5,403

	2017	2016	2015
Low	3.25%	2.75%	4.50%
High	4.50	4.50	7.00
Weighted-average	4.03	3.82	5.90