FinTabNet

Format	PDF, JSON
License	CLDA-Permissive
Domain	Computer Vision
Number of Records	89,646 pages comprising 112,887 tables with cell structure
Size	16 GB
Data Origin	Publically available earnings reports from S&P500 companies and annotations from IBM Research.
Dataset Version	Version 1 - November 2020
Dataset Coverage	The dataset consists of annotated earning's reports from S&P500 companies
Business Use Case	Table data extraction - The dataset can be used to train a model to extract data from complex tables. This can aid businesses dealing with a large number of tabular documents to easily extract information and load it into their databases.

Image (screenshot of PDF, see notebook for details on how to extract image from PDF) Annotated Image (Generated with Notebook) JSON label

{
  "bbox": [
    50.3946,
    516.4759,
    302.2046000000001,
    569.906
  ],
  "filename": "FDX/2017/page_64.pdf",
  "html": {
    "cells": [
      {
        "tokens": []
      },
      {
        "bbox": [
          171.06,
          557.51,
          188.23,
          567.52
        ],
        "tokens": [
          "2",
          "0",
          "1",
          "7"
        ]
      },
      {
        "bbox": [
          227.39,
          557.51,
          244.56,
          567.52
        ],
        "tokens": [
          "2",
          "0",
          "1",
          "6"
        ]
      },
      {
        "bbox": [
          282.82,
          557.51,
          299.99,
          567.52
        ],
        "tokens": [
          "2",
          "0",
          "1",
          "5"
        ]
      },
      {
        "bbox": [
          50.39,
          543.83,
          109.25,
          553.84
        ],
        "tokens": [
          "M",
          "i",
          "n",
          "i",
          "m",
          "u",
          "m",
          " ",
          "r",
          "e",
          "n",
          "t",
          "a",
          "l",
          "s"
        ]
      },
      {
        "bbox": [
          162.78,
          543.83,
          188.23,
          553.84
        ],
        "tokens": [
          "$",
          "2",
          ",",
          "8",
          "1",
          "4"
        ]
      },
      {
        "bbox": [
          219.11,
          543.83,
          244.56,
          553.84
        ],
        "tokens": [
          "$",
          "2",
          ",",
          "3",
          "9",
          "4"
        ]
      },
      {
        "bbox": [
          274.54,
          543.83,
          299.99,
          553.84
        ],
        "tokens": [
          "$",
          "2",
          ",",
          "2",
          "4",
          "9"
        ]
      },
      {
        "bbox": [
          50.39,
          530.15,
          118.38,
          540.36
        ],
        "tokens": [
          "C",
          "o",
          "n",
          "t",
          "i",
          "n",
          "g",
          "e",
          "n",
          "t",
          " ",
          "r",
          "e",
          "n",
          "t",
          "a",
          "l",
          "s",
          "^{",
          "(",
          "1",
          ")",
          "}"
        ]
      },
      {
        "bbox": [
          175.3,
          530.15,
          188.23,
          540.16
        ],
        "tokens": [
          "1",
          "7",
          "8"
        ]
      },
      {
        "bbox": [
          231.63,
          530.15,
          244.56,
          540.16
        ],
        "tokens": [
          "2",
          "1",
          "4"
        ]
      },
      {
        "bbox": [
          287.06,
          530.15,
          299.99,
          540.16
        ],
        "tokens": [
          "1",
          "9",
          "4"
        ]
      },
      {
        "tokens": []
      },
      {
        "bbox": [
          162.78,
          516.47,
          188.23,
          526.48
        ],
        "tokens": [
          "$",
          "2",
          ",",
          "9",
          "9",
          "2"
        ]
      },
      {
        "bbox": [
          219.11,
          516.47,
          244.56,
          526.48
        ],
        "tokens": [
          "$",
          "2",
          ",",
          "6",
          "0",
          "8"
        ]
      },
      {
        "bbox": [
          274.54,
          516.47,
          299.99,
          526.48
        ],
        "tokens": [
          "$",
          "2",
          ",",
          "4",
          "4",
          "3"
        ]
      }
    ],
    "structure": {
      "tokens": [
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        " ",
        " ",
        " ",
        "
",
        " ",
        " ",
        " ",
        "
",
        " ",
        " ",
        " ",
        "
",
        " ",
        " ",
        " ",
        ""
      ]
    }
  },
  "split": "val",
  "table_id": 55224
}

Feature	Description
pdf	Folder containing all the PDFs sorted by company stock ticker and year (subfolder)
annotations	Each table instance annotation contains a series of fields, including the HTML structure and bounding box coordinates of the table. This information is in the JSON file and the top level file/list is henceforth referred to as annotations
annotations -> table_id	A unique ID assigned to each table in the dataset.
annotations -> html -> structure	describes the HTML structure of the table
annotations -> html -> cells	contains the actual values inside the table cells. This is the information present in the table.
annotations -> split	test/train/val split
annotations -> filename	the name of the underlying PDF file where this table is from.
annotations -> bbox	the bounding box coordinates (in the PDF file) for the particular table.