PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents (original) (raw)

TAO: System for Table Detection and Extraction from PDF Documents

2016

Digital documents present knowledge in most areas of study, exchanging and communicating information in a portable way. To better use the knowledge embedded in an ever-growing information source, effective tools for automatic information extraction are needed. Tables are crucial information elements in documents of scientific nature. Most publications use tables to represent and report concrete findings of research. Current methods used to extract table data from PDF documents lack precision in detecting, extracting, and representing data from diverse layouts. We present the system TAble Organization (TAO) to automatically detect, extract and organize information from tables in PDF documents. TAO uses a processing, based on the k-nearest neighbor method and layout heuristics, to detect tables within a document and to extract table information. This system generates an enriched representation of the data extracted from tables in the PDF documents. TAO’s performance is comparable to o...

TabbyPDF: Web-Based System for PDF Table Extraction

Communications in Computer and Information Science, 2018

PDF is one of the most widespread ways to represent noneditable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of "ICDAR 2013 Table Competition". The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.

pdf2table: A Method to Extract Table Information from PDF Files

2005

Tables are a common structuring element in many documents, such as PDF files. To reuse such tables, appropriate methods need to be develop, which capture the structure and the content information. We have developed several heuristics which together recognize and decompose tables in PDF files and store the extracted data in a structured data format (XML) for easier reuse. Additionally, we implemented a prototype, which gives the user the ability of making adjustments on the extracted data. Our work shows that purely heuristic-based approaches can achieve good results, especially for lucid tables.

A methodology for evaluating algorithms for table understanding in PDF documents

2012

This paper presents a methodology for the evaluation of table understanding algorithms for PDF documents. The evaluation takes into account three major tasks: table detection, table structure recognition and functional analysis. We provide a general and flexible output model for each task along with corresponding evaluation metrics and methods. We also present a methodology for collecting and ground-truthing PDF documents based on consensusreaching principles and provide a publicly available groundtruthed dataset.

Detection, extraction and representation of tables

Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., 2003

We are concerned with the extraction of tables from exchange format representations of very diverse composite documents. We put forward a flexible representation scheme for complex tables, based on a clear distinction between the physical layout of a table and its logical structure. Relying on this scheme, we develop a new method for the detection and the extraction of tables by an analysis of the graphic lines.

Automatic Selection of Table Areas in Documents for Information Extraction

Lecture Notes in Computer Science, 2003

information contained in companies' financial statements is valuable for decision making at various levels. Much of the relevant information in such documents is contained in tables and is currently extracted mainly by hand. We propose a method that accomplishes a preliminary step of the task of automatically extracting information from tables in documents: selecting the lines of the document which are likely to belong to the tables containing the information to be extracted. Our method has been developed by empirically analyzing a set of Portuguese companies' financial statements, using statistical and data mining techniques. Empirical evaluation indicates that more than 99% of the table lines are selected after discarding at least 50% of them. The method can cope with the complexity of styles used in assembling information on paper and adapt its performance accordingly, thus maximizing its results.

Notes on Contemporary Table Recognition

The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the transition by some actual examples of web table conversion. We then suggest that the appropriate target format for table analysis, whether performed by conventional customized programs or by off-theshelf software, is a representation based on the abstract table introduced by X. Wang in 1996. We show that the Wang model is adequate for some useful tasks that prove elusive for less explicit representations, and outline our plans to develop a semi-automated table processing system to demonstrate this approach. Screen-snaphots of a prototype tool to allow table mark-up in the style of Wang are also presented.

Locating Tables in Scanned Documents for Reconstructing and Republishing

Pool of knowledge available to the mankind depends on the source of learning resources, which can vary from ancient printed documents to present electronic material. The rapid conversion of material available in traditional libraries to digital form needs a significant amount of work if we are to maintain the format and the look of the electronic documents as same as their printed counterparts. Most of the printed documents contain not only characters and its formatting but also some associated non text objects such as tables, charts and graphical objects. It is challenging to detect them and to concentrate on the format preservation of the contents while reproducing them. To address this issue, we propose an algorithm using local thresholds for word space and line height to locate and extract all categories of tables from scanned document images. From the experiments performed on 298 documents, we conclude that our algorithm has an overall accuracy of about 75% in detecting tables from the scanned document images. Since the algorithm does not completely depend on rule lines, it can detect all categories of tables in a range of scanned documents with different font types, styles and sizes to extract their formatting features. Moreover, the algorithm can be applied to locate tables in multi column layouts with small modification in layout analysis. Treating tables with their existing formatting features will tremendously help the reproducing of printed documents for reprinting and updating purposes.

Table Extraction from PDF Files using combination of Heuristics and Machine Learning Techniques

Iicai, 2009

This paper presents a decomposition algorithm for solving the uniform Travelling Salesman Problem (TSP). The algorithm assumes that the initial set of points is divided into subsets and overlapping regions between these subsets. The algorithm starts by considering a subset and find the solution in that subset. The solution in the next subset is calculated by expanding the earlier solution for the current subset and this process continues until all the subsets are considered. In this algorithm, a given problem with n points is replaced by m problems, each having k (k << n) points. The algorithm is tested with 10000-points and the results prove that it considerably reduces the computational time. In addition, its quality is close to that of the best known heuristic algorithm.