TableSeer: automatic table metadata extraction and searching in digital libraries (original) (raw)

TableSeer

Proceedings of the 2007 conference on Digital libraries - JCDL '07, 2007

Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make the table search problem challenging. In this paper, we describe Ta-bleSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm -TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, TableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.

Tablerank: A ranking algorithm for table search and retrieval

2007

Tables are ubiquitous in web pages and scientific documents. With the explosive development of the web, tables have become a valuable information repository. Therefore, effectively and efficiently searching tables becomes a challenge. Existing search engines do not provide satisfactory search results largely because the current ranking schemes are inadequate for table search and automatic table understanding and extraction are rather difficult in general. In this work, we design and evaluate a novel table ranking algorithm -TableRank to improve the performance of our table search engine Table-Seer. Given a keyword based table query, TableRank facilities TableSeer to return the most relevant tables by tailoring the classic vector space model. TableRank adopts an innovative term weighting scheme by aggregating multiple weighting factors from three levels: term, table and document. The experimental results show that our table search engine outperforms existing search engines on table search. In addition, incorporating multiple weighting factors can significantly improve the ranking results.

Towards Semantic Exploration of Tables in Scientific Documents

Workshop on Semantic Technologies for Scientific, Technical and Legal Data, Extended Semantic Web Conference, 2023

Structured data artifacts such as tables are widely used in scientific literature to organize and concisely communicate important statistical information. Discovering relevant information in these tables remains a significant challenge owing to their structural heterogeneity, dense and often implicit semantics, and diffuse context. This paper describes how we leverage semantic technologies to enable technical experts to search and explore tabular data embedded within scientific documents. We present a system for the on-demand construction of knowledge graphs representing scientific tables (drawn from online scholarly articles hosted by PubMed Central), and for synthesizing tabular responses to semantic search requests against such graphs. We discuss key differentiators in our overall approach, including a two-stage semantic table interpretation that relies on an extensive structural and syntactic characterization of scientific tables, and a prototype knowledge discovery engine that uses automatically-inferred semantics of scientific tables to serve search requests by potentially fusing information from multiple tables on the fly. We evaluate our system on a real-world dataset of approximately 120,000 tables extracted from over 62,000 COVID-19-related scientific articles.

TAO: System for Table Detection and Extraction from PDF Documents

2016

Digital documents present knowledge in most areas of study, exchanging and communicating information in a portable way. To better use the knowledge embedded in an ever-growing information source, effective tools for automatic information extraction are needed. Tables are crucial information elements in documents of scientific nature. Most publications use tables to represent and report concrete findings of research. Current methods used to extract table data from PDF documents lack precision in detecting, extracting, and representing data from diverse layouts. We present the system TAble Organization (TAO) to automatically detect, extract and organize information from tables in PDF documents. TAO uses a processing, based on the k-nearest neighbor method and layout heuristics, to detect tables within a document and to extract table information. This system generates an enriched representation of the data extracted from tables in the PDF documents. TAO’s performance is comparable to o...

Disentangling the Structure of Tables in Scientific Literature

Lecture Notes in Computer Science, 2016

Within the scientific literature, tables are commonly used to present factual and statistical information in a compact way, which is easy to digest by readers. The ability to "understand" the structure of tables is key for information extraction in many domains. However, the complexity and variety of presentation layouts and value formats makes it difficult to automatically extract roles and relationships of table cells. In this paper, we present a model that structures tables in a machine readable way and a methodology to automatically disentangle and transform tables into the modelled data structure. The method was tested in the domain of clinical trials: it achieved an F-score of 94.26% for cell function identification and 94.84% for identification of inter-cell relationships.

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

Knowledge Engineering and Semantic Web, 2015

The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workflow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.

pdf2table: A Method to Extract Table Information from PDF Files

2005

Tables are a common structuring element in many documents, such as PDF files. To reuse such tables, appropriate methods need to be develop, which capture the structure and the content information. We have developed several heuristics which together recognize and decompose tables in PDF files and store the extracted data in a structured data format (XML) for easier reuse. Additionally, we implemented a prototype, which gives the user the ability of making adjustments on the extracted data. Our work shows that purely heuristic-based approaches can achieve good results, especially for lucid tables.

TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

Document Analysis and Recognition – ICDAR 2021, 2021

Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles. TabLeX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding L A T E X source code. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to augment this dataset with more complex and diverse tables at regular intervals.

Challenges in Information Extraction from Tables in Biomedical Research Publications: a Dataset Analysis

We present a study of a dataset of tables from biomedical research publications. Our aim is to identify characteristics of biomedical tables that pose challenges for the task of extracting information from tables, and to determine which parts of research papers typically contain information that is useful for this task. Our results indicate that biomedical tables are hard to interpret without their source papers due to the brevity of the entries in the tables. In many cases, unstructured text segments, such as table titles, footnotes and non-table prose discussing a table, are required to interpret the table's entries.

Technique for searching tabular form documents using metadata harvested by table structure analysis

Artificial Intelligence Research, 2014

Conducting full-text searches on collections of tabular files, in which a single sheet corresponds to a single document and each file consists of multiple sheets, typically involves retrieving many candidate files that include the search terms. Opening each of these tabular files to determine whether it is the desired sheet is labor-intensive. Searching with high precision thus requires expert intuition born of operational experience. Therefore, it would be advantageous to enable the pinpointing of desired documents with greater accuracy regardless of the operator's level of experience. In the present study, we propose a method in which operational classifications are assigned as metadata on the basis of the table structure of a sheet. We obtain the table structure of the sheet and assign metadata based on a set of rules established individually for each pattern in the structure. We propose two methods for representing the table structures obtained: a method using node property matrix, and a method in which positional data regarding cells containing specific operation-description data are indexed. Comparing the results of searches that use assigned metadata to the results of traditional full-text searches reveals that our method has greater search accuracy.