GitHub - IBM/SynthTabNet: Dataset of PNG images from synthetically generated table layouts with annotations in JSONL files (original) (raw)

SynthTabNet is a dataset of 600k png images from synthetically generated table layouts with annotations in jsonl files.

Overview

SynthTabNet is a synthetically generated dataset that contains annotated images of data in tabular layouts.

It has been shown that other non-synthetic datasets like PubTabNet, FinTabNet and TableBank suffer from many limitations:

SynthTabNet aims to overcome these limitations by providing:

SynthTabNet is organized into 4 parts of 150k tables (600k in total). Each part contains tables with different appearances in regard to their size, structure, style and content. All parts are divided into Train, Test and Val splits (80%, 10%, 10%). The tables are delivered as png images and the annotations are in jsonl format.

A detailed description of the data synthesis process can be found in the paper.

Download

v2.0.0

Appearance style Records Size(GB) URL v2.0.0
Fintabnet 150k 10 SynthTabNet-part1
Marketing 150k 8 SynthTabNet-part2
PubTabNet 150k 6 SynthTabNet-part3
Sparse 150k 3 SynthTatNet-part4

v2.0.0 MD5 checksums

v2.0.0 SHA1 checksums

v1.0.0

Appearance style Records Size(GB) URL v1.0.0
Fintabnet 150k 10 SynthTabNet-part1
Marketing 150k 8 SynthTabNet-part2
PubTabNet 150k 6 SynthTabNet-part3
Sparse 150k 3 SynthTatNet-part4

v1.0.0 MD5 checksums

v1.0.0 SHA1 checksums

Data format

Each part of the dataset corresponds to a top level directory (fintabnet, marketing, pubtabnet, sparse) and has the following structure:

├── images
│   ├── test
│   ├── train
│   └── val
├── synthetic_data.jsonl

The annotations for each part are in the synthetic_data.jsonl file. Each line is a json object that corresponds to a png image and has the following structure:

"filename": "png image filename inside one of the 'test', 'train', 'val' directories",
"split": "One of 'test', 'train', 'val'",
"html": "Table structure and content",
    "cells": "Array with all table cells",
        "cell_id": "Zero based cell counter",
        "is_header": "true if that cell is part of the table header",
        "span": "In case there is a rowspan / columnspan",
            "spantype": "One of 'rowspan', 'colspan', '2dspan'. The '2dspan' is used in case there is a rowspan and colspan in the same cell",
            "rowspan": "Number of rowspans for this cell",
            "colspan": "Number of colspans for this cell"
        "tokens": "Array with the tokenized content of the cell",
        "bbox": "The bounding bbox and the class of the cell in [x1, y1, x2, y2, class] format"
    "structure":
        "tokens": "Array with html tags that describe the table structure"

Regarding the bbox parameter notice that:

The tokens can be one of:

" colspan=\"10\"", " colspan=\"2\"", " colspan=\"3\"", " colspan=\"4\"", " colspan=\"5\"",
" colspan=\"6\"", " colspan=\"7\"", " colspan=\"8\"", " colspan=\"9\"", " rowspan=\"10\"",
" rowspan=\"2\"", " rowspan=\"3\"", " rowspan=\"4\"", " rowspan=\"5\"", " rowspan=\"6\"",
" rowspan=\"7\"", " rowspan=\"8\"", " rowspan=\"9\"", "</tbody>", "</td>", "</thead>",
"</tr>", "<end>", "<pad>", "<start>", "<tbody>", "<td", "<td>", "<thead>", "<tr>", "<unk>", ">"

Example data

pubtabnet

sparse

fintabnet

marketing

Jupyter notebook

Here is a jupyter notebook that demonstrates how to download and use the dataset:

Demo Notebook

Paper

"TableFormer: Table Structure Understanding with Transformers" (CVPR 2022).

ArXiv link: https://arxiv.org/abs/2203.01017

Citation:

@article{nassar2022tableformer,
  title={TableFormer: Table Structure Understanding with Transformers},
  author={Nassar, Ahmed and Livathinos, Nikolaos and Lysak, Maksym and Staar, Peter},
  journal={arXiv preprint arXiv:2203.01017},
  year={2022}
}