Using natural language processing for identifying and interpreting tables in plain text (original) (raw)

Recognizing and Interpreting Tables

2000

Tables are a pervasive problem in the real-world corpora that are the focus of information extraction applications. Moreover, they con- stitute a problem of signicant linguistic interest. In this paper we present a general method for recognizing and interpreting tables in text, and describe its implementation in a particular application.

Natural Computing: Analysis of Tables for Computer Representation

Some fundamental objects in practical documents have not been implemented in software so that they can be used easily for calculating. One such object is the table-despite the mistaken view that databases are adequate representations of tables. A survey of practical tables found in a variety of real-world documents reveals that many of their useful features are not captured in software. This report proposes data structures and computer representation for table objects. Through the adoption of such structures and representations, practical table objects can be developed for use by domain specialists. Such tables embedded in electronic documents can be used in interactive applications to retrieve data, but most importantly, they can be used as functional representations for copying and pasting into procedures and programs. Use of these Source: Joseph E. Shigley (1977). Meclzn~~ical hginecriq Dcsip, 3rd ed.,

Tabular Representations in Relational Documents

Relational Methods in Computer Science, 1997

The use of relations, represented as tables, for documenting the requirements and behaviour of software is motivated and explained. A formal model of tabular expressions, defining the meaning of a large class of tabular forms, is presented. Finally, we discuss the transformation of tabular expressions from one form to another, and illustrate some useful transformations.

A Tabular Survey of Automated Table Processing

Lecture Notes in Computer Science, 2000

Tables are the only acceptable means of communicating certain types of structured data. A precise definition of "tabularity" remains elusive because some bureaucratic forms, multicolumn text layouts, and schematic drawings share many characteristics of tables. There are significant differences between typeset tables, electronic files designed for display of tables, and tables in symbolic form intended for information retrieval. Although most research to date has addressed the extraction of low-level geometric information from scanned raster images of paper tables, the recent trend toward the analysis of tables in electronic form may pave the way to a higher level of table understanding. Recent research on table composition and table analysis has improved our understanding of the distinction between the logical and physical structures of tables, and has led to improved formalisms for modeling tables. The present study indicates that progress on half-a-dozen specific research issues would open the door to using existing paper and electronic tables for database update, tabular browsing, structured information retrieval through graphical and audio interfaces, multimedia table editing, and platform-independent display. Although tables are not a conventional format for conveying the primary content of technical papers, here we attempt to subdue our natural garrulity by adopting this genre to communicate what we have to say about tables entirely in tabular form.

Detection, extraction and representation of tables

Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., 2003

We are concerned with the extraction of tables from exchange format representations of very diverse composite documents. We put forward a flexible representation scheme for complex tables, based on a clear distinction between the physical layout of a table and its logical structure. Relying on this scheme, we develop a new method for the detection and the extraction of tables by an analysis of the graphic lines.

Table-processing paradigms: a research survey

International Journal of Document Analysis and Recognition (IJDAR), 2006

While everyone seems to know what a table is, a precise, analytical definition of "tabularity" remains elusive because some bureaucratic forms, multicolumn text layouts, and schematic drawings share many characteristics of tables. There are significant differences between typeset tables, electronic files designed for display of tables, and tables in symbolic form intended for information retrieval. Most past research has addressed the extraction of low-level geometric information from raster images of tables scanned from printed documents, although there is growing interest in the processing of tables in electronic form as well. Recent research on table composition and table analysis has improved our understanding of the distinction between the logical and physical structures of tables, and has led to improved formalisms for modeling tables. This review, which is structured in terms of generalized paradigms for table processing, indicates that progress on half-a-dozen specific research issues would open the door to using existing paper and electronic tables for database update, tabular browsing, structured information retrieval through graphical and audio interfaces, multimedia table editing, and platform-independent display.

Table structure understanding and its performance evaluation

Pattern Recognition, 2004

This paper presents a table structure understanding algorithm designed using optimization methods. The algorithm is probability based, where the probabilities are estimated from geometric measurements made on the various entities in a large training set. The methodology includes a global parameter optimization scheme, a novel automatic table ground truth generation system and a table structure understanding performance evaluation protocol. With a document data set having 518 table and 10,934 cell entities, it performed at the 96.76% accuracy rate on the cell level and 98.32% accuracy rate on the table level.

Semantically Conceptualizing and Annotating Tables

Lecture Notes in Computer Science, 2008

Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly "how?" is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntactically observed table layout into semantically coherent ontological concepts, relationships, and constraints. Our semanticenrichment procedure shows how to make use of auxiliary world knowledge to construct rich ontological structures and to populate these ontological structures with instance data. The system uses auxiliary knowledge (1) to recognize concepts and which data values belong to which concepts, (2) to discover relationships among concepts and which datavalue combinations represent relationship instances, and (3) to discover constraints over the concepts and relationships that the data values and data-value combinations should satisfy. Experimental evaluations indicate that the automatic conceptualization and annotation processes perform well, yielding F-measures of 90% for concept recognition, 77% for relationship discovery, and 90% for constraint discovery in web tables selected from the geopolitical domain.

A new table interpretation methodology with little knowledge base: table interpretation methodology

2006

In this paper, a new methodology for table-form interpretation with little previous knowledge is presented. The first module performs the identification of line intersections in a table-form, the second module detects and corrects wrong intersections produced by fault intersection segments or by table artefacts (smudges, overlapping of handwritten data and fault segments). The third module performs the tableform cell extraction. The features used to interpret the table-form are directly extracted from the image itself by means of morphological tools. The evaluation of the efficiency is carried out from a total of 305 table-form images. Experiments showed significant and promising results. The proposed approach reached a success rate over than 87% on average. The main advantage of the proposed methodology is requiring little knowledge from documents, being able to apply for a table-form majority.

A new table interpretation methodology with little knowledge base

Proceedings of the 2006 ACM symposium on Applied computing - SAC '06, 2006

In this paper, a new methodology for table-form interpretation with little previous knowledge is presented. The first module performs the identification of line intersections in a table-form, the second module detects and corrects wrong intersections produced by fault intersection segments or by table artefacts (smudges, overlapping of handwritten data and fault segments). The third module performs the tableform cell extraction. The features used to interpret the table-form are directly extracted from the image itself by means of morphological tools. The evaluation of the efficiency is carried out from a total of 305 table-form images. Experiments showed significant and promising results. The proposed approach reached a success rate over than 87% on average. The main advantage of the proposed methodology is requiring little knowledge from documents, being able to apply for a table-form majority.