unilm/layoutreader at master · microsoft/unilm (original) (raw)

LayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It significantly improves both open-source and commercial OCR engines in ordering text lines in their results in our experiments.

Our paper "LayoutReader: Pre-training of Text and Layout for Reading Order Detection" has been accepted by EMNLP 2021.

ReadingBank is a benchmark dataset for reading order detection built with weak supervision from WORD documents, which contains 500K document images with a wide range of document types as well as the corresponding reading order information. For more details, please refer to ReadingBank.

Installation

conda create -n LayoutReader python=3.7
conda activate LayoutReader
conda install pytorch==1.7.1 -c pytorch
pip install nltk
python -c "import nltk; nltk.download('punkt')"
git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext
pip install transformers==2.10.0
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutreader
pip install -e .

Run

  1. Download the pre-processed data (ReadingBank.zip). For more details of the dataset, please refer to ReadingBank.
  2. (Optional) Download our pre-trained model (layoutreader-base-readingbank.zip) and evaluate it refer to step 4.
  3. Training
export CUDA_VISIBLE_DEVICE=0,1,2,3  
export OMP_NUM_THREADS=4  
export MKL_NUM_THREADS=4  
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \  
    --model_type layoutlm \  
    --model_name_or_path layoutlm-base-uncased \  
    --train_folder /path/to/ReadingBank/train \  
    --output_dir /path/to/output/LayoutReader/layoutlm \  
    --do_lower_case \  
    --fp16 \  
    --fp16_opt_level O2 \  
    --max_source_seq_length 513 \  
    --max_target_seq_length 511 \  
    --per_gpu_train_batch_size 2 \  
    --gradient_accumulation_steps 1 \  
    --learning_rate 7e-5 \  
    --num_warmup_steps 500 \  
    --num_training_steps 75000 \  
    --cache_dir /path/to/output/LayoutReader/cache \  
    --label_smoothing 0.1 \  
    --save_steps 5000 \  
    --cached_train_features_file /path/to/ReadingBank/features_train.pt  
  1. Decoding
export CUDA_VISIBLE_DEVICES=0  
export OMP_NUM_THREADS=4  
export MKL_NUM_THREADS=4  
python decode_seq2seq.py --fp16 \  
    --model_type layoutlm \  
    --tokenizer_name bert-base-uncased \  
    --input_folder /path/to/ReadingBank/test \  
    --cached_feature_file /path/to/ReadingBank/features_test.pt \  
    --output_file /path/to/output/LayoutReader/layoutlm/output.txt \  
    --split test \  
    --do_lower_case \  
    --model_path /path/to/output/LayoutReader/layoutlm/ckpt-75000 \  
    --cache_dir /path/to/output/LayoutReader/cache \  
    --max_seq_length 1024 \  
    --max_tgt_length 511 \  
    --batch_size 32 \  
    --beam_size 1 \  
    --length_penalty 0 \  
    --forbid_duplicate_ngrams \  
    --mode s2s \  
    --forbid_ignore_word "."  

Results

Our released pre-trained model achieves 98.2% Average Page-level BLEU score. Detailed results are reported as follow:

Citation

If you find LayoutReader helpful, please cite us:

@misc{wang2021layoutreader,
      title={LayoutReader: Pre-training of Text and Layout for Reading Order Detection}, 
      author={Zilong Wang and Yiheng Xu and Lei Cui and Jingbo Shang and Furu Wei},
      year={2021},
      eprint={2108.11591},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers and s2s-ft projects.Microsoft Open Source Code of Conduct

Contact

For help or issues using LayoutReader, please submit a GitHub issue.

For other communications related to LayoutLM, please contact Lei Cui (lecu@microsoft.com), Furu Wei (fuwei@microsoft.com).