Preview and Explore Data

Data Asset Exchange - PubLayNet

Sample Records

annotated images example

Detailed label preview


Format JPG images with JSON label
License CDLA-Permissive
Domain Computer Vision
Number of Records 340,000 images
Size 102 GB
Origin Images of research papers from PubMed and annotations from IBM Research Australia.
Dataset Version Update Version 1 – August 07, 2019
Dataset Coverage The dataset contains images of research papers from the medical domain.
Business Use Case

    Document Understanding

  • The dataset can be used to train a model to extract various elements of a document such as tables, figures, texts etc. This can aid businesses dealing with a large number of documents to easily categorize the various elements in their documents.

Dataset Feature Definition

Feature Description
images JSON field containing a list of images and their metadata (size, ID, name)
annotations Each object instance annotation contains a series of fields, including the category id and segmentation mask of the object.
annotations -> segmentations Contains the polygon coordinates for the segmentation mask for the specific class instance (table, list, text etc)
annotations -> bbox Contains the bounding box coordinates for the specific class instance (table, list, text etc).
annotations -> is_crowd This field indicates whether the class instance is a single object (is_crowd=0) or multiple objects (is_crowd=1). In this dataset we only have single objects so this field is always set to 0.
annotations -> category_id The class label for the current class instance. This indicates what the current bbox/segmentation mask encapsulates (table, list, text etc).
categories JSON field containing a list of classes and their metadata (ID, name) This dataset has 5 categories (w/ corresponding "ids") - text ("1"), title ("2"), list ("3"), table ("4"), figure ("5").