Paper page - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Understanding (original) (raw)

Published on Dec 29, 2020

Abstract

LayoutLMv2, a two-stream multi-modal Transformer architecture, significantly enhances document understanding tasks by incorporating new pre-training tasks and spatial-aware self-attention.

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 to 0.8420), CORD (0.9493 to 0.9601), SROIE (0.9524 to 0.9781), Kleister-NDA (0.8340 to 0.8520),RVL-CDIP (0.9443 to 0.9564), and DocVQA (0.7295 to 0.8672). We made our model and code publicly available at https://aka.ms/layoutlmv2.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2012.14740

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

microsoft/layoutlmv2-base-uncased Updated Sep 16, 2022 • 611k • 67

microsoft/layoutlmv2-large-uncased Updated Sep 16, 2022 • 6.94k • 11

aslessor/layoutlmv2-base-uncased Other • Updated Jan 5, 2024 • 6

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2012.14740 in a dataset README.md to link it from this page.

Spaces citing this paper 21

Collections including this paper 5

Browse 5 collections that include this paper