Retrieval-Based Transformer for Table Augmentation (original) (raw)

Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Following the success of pre-training techniques in the natural language domain, a flurry of table pre-training frameworks have been proposed and have achieved new state-of-the-arts on various downstream tasks such as table question answering, table type recognition, column relation classification, table search, and formula prediction. Various model architectures have been explored to best capture the characteristics of (semi-)structured tables, especially specially-designed attention mechanisms. Moreover, to fully leverage the supervision signals in unlabeled tables, diverse pre-training objectives have been designed and evaluated, for example, denoising cell values, predicting numerical relationships, and learning a neural SQL executor. This survey aims to provide a comprehensive review of model designs, pre-training objectives, and downstream tasks for table pre-training, and we further share our thoughts on existing challenges and future opportunities.

Structure-aware Pre-training for Table Understanding with Tree-based Transformers

ArXiv, 2020

Tables are widely used with various structures to organize and present data. Recent attempts on table understanding mainly focus on relational tables, yet overlook to other common table structures. In this paper, we propose TUTA, a unified pre-training architecture for understanding generally structured tables. Since understanding a table needs to leverage both spatial, hierarchical, and semantic information, we adapt the self-attention strategy with several key structure-aware mechanisms. First, we propose a novel tree-based structure called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information in tables. Upon this, we extend the pre-training architecture with two core mechanisms, namely the tree-based attention and tree-based position embedding. Moreover, to capture table information in a progressive manner, we devise three pre-training objectives to enable representations at the token, cell, and table levels. TUTA pre-trains on a wide range o...

GetPt: Graph-enhanced General Table Pre-training with Alternate Attention Network

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Tables are widely used for data storage and presentation due to their high flexibility in layout. The importance of tables as information carriers and the complexity of tabular data understanding attract a great deal of research on large-scale pre-training for tabular data. However, most of the works design models for specific types of tables, such as relational tables and tables with well-structured headers, neglecting tables with complex layouts. In real-world scenarios, there are many such tables beyond the target scope of previous research and are thus not well supported. In this paper, we propose GetPt, a unified pre-training architecture for general table representation applicable even to tables with complex structures and layouts. First, we convert a table to a heterogeneous graph to represent the layout of the table. Based on the graph, a specially designed transformer is applied to jointly model the semantics and structure of the table. Second, we devise an Alternate Attention Network (AAN) to better model the contextual information across multiple granularities of a table including the tokens, cells, and table. To better support a wide range of downstream tasks, we further * Both authors contributed equally.

CLTR: An End-to-End, Transformer-Based System for Cell Level Table Retrieval and Table Question Answering

ArXiv, 2021

We present the first end-to-end, transformerbased table question answering (QA) system that takes natural language questions and massive table corpus as inputs to retrieve the most relevant tables and locate the correct table cells to answer the question 1. Our system, CLTR, extends the current state-of-the-art QA over tables model to build an end-to-end table QA architecture. This system has successfully tackled many real-world table QA problems with a simple, unified pipeline. Our proposed system can also generate a heatmap of candidate columns and rows over complex tables and allow users to quickly identify the correct cells to answer questions. In addition, we introduce two new open-domain benchmarks, E2E WTQ and E2E GNQ, consisting of 2,005 natural language questions over 76,242 tables. The benchmarks are designed to validate CLTR as well as accommodate future table retrieval and end-to-end table QA research and experiments. Our experiments demonstrate that our system is the cu...

Answering table augmentation queries from unstructured lists on the web

2009

We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows -(a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation.

STable: Table Generation Framework for Encoder-Decoder Models

Cornell University - arXiv, 2022

The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%. Soda 1 3 3 Extracted line items Plain text news Wikipedia article Invoice Encoder-decoder model Key information / property-value pairs Property Value Date of birth 1915-01-15 Place of birth Saint Petersburg Citizenship Russian Empire * Equal contribution. Preprint. Under review.

Capturing Row and Column Semantics in Transformer Based Question Answering over Tables

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Transformer based architectures are recently used for the task of answering questions over tables. In order to improve the accuracy on this task, specialized pre-training techniques have been developed and applied on millions of open-domain web tables. In this paper, we propose two novel approaches demonstrating that one can achieve superior performance on table QA task without even using any of these specialized pre-training techniques. The first model, called RCI interaction, leverages a transformer based architecture that independently classifies rows and columns to identify relevant cells. While this model yields extremely high accuracy at finding cell values on recent benchmarks, a second model we propose, called RCI representation, provides a significant efficiency advantage for online QA systems over tables by materializing embeddings for existing tables. Experiments on recent benchmarks prove that the proposed methods can effectively locate cell values on tables (up to ∼98% Hit@1 accuracy on WikiSQL lookup questions). Also, the interaction model outperforms the state-of-the-art transformer based approaches, pre-trained on very large table corpora (TAPAS and TABERT), achieving ∼3.4% and ∼18.86% additional precision improvement on the standard WikiSQL benchmark 1 .

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

2020

Selfand semi-supervised learning frameworks have made significant progress in training machine learning models with limited labeled data in image and language domains. These methods heavily rely on the unique structure in the domain datasets (such as spatial relationships in images or semantic relationships in language). They are not adaptable to general tabular data which does not have the same explicit structure as image and language data. In this paper, we fill this gap by proposing novel selfand semi-supervised learning frameworks for tabular data, which we refer to collectively as VIME (Value Imputation and Mask Estimation). We create a novel pretext task of estimating mask vectors from corrupted tabular data in addition to the reconstruction pretext task for self-supervised learning. We also introduce a novel tabular data augmentation method for selfand semi-supervised learning frameworks. In experiments, we evaluate the proposed framework in multiple tabular datasets from var...

Tablext: A Combined Neural Network And Heuristic Based Table Extractor

ArXiv, 2021

A significant portion of the data available today is found within tables. Therefore, it is necessary to use automated table extraction to obtain thorough results when data-mining. Today’s popular state-of-the-art methods for table extraction struggle to adequately extract tables with machine-readable text and structural data. To make matters worse, many tables do not have machine-readable data, such as tables saved as images, making most extraction methods completely ineffective. In order to address these issues, a novel, general format table extractor tool, Tablext, is proposed. This tool uses a combination of computer vision techniques and machine learning methods to efficiently and effectively identify and extract data from tables. Tablext begins by using a custom Convolutional Neural Network (CNN) to identify and separate all potential tables. The identification process is optimized by combining the custom CNN with the YOLO object detection network. Then, the highlevel structure...

Table Search Using a Deep Contextualized Language Model

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pretrained contextualized language models such as BERT have achieved impressive results on various natural language processing benchmarks. Benefiting from multiple pretraining tasks and large scale training corpora, pretrained models can capture complex syntactic word relations. In this paper, we use the deep contextualized language model BERT for the task of ad hoc table retrieval. We investigate how to encode table content considering the table structure and input length limit of BERT. We also propose an approach that incorporates features from prior literature on table retrieval and jointly trains them with BERT. In experiments on public datasets, we show that our best approach can outperform the previous state-of-the-art method and BERT baselines with a large margin under different evaluation metrics. CCS CONCEPTS • Information systems → Content analysis and feature selection; Retrieval models and ranking; • Computing methodologies → Search methodologies.