ByteDance/Dolphin-v2 · Hugging Face (original) (raw)

Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting

Model Description

Dolphin-v2 is an enhanced universal document parsing model that substantially improves upon the original Dolphin. It seamlessly handles any document type—whether digital-born or photographed—through a document-type-aware two-stage architecture with scalable anchor prompting.

📑 Key Improvements

Dolphin-v2 introduces several major enhancements over the original Dolphin:

🏗️ Model Architecture

Dolphin-v2 follows a document-type-aware two-stage paradigm:

Stage 1: Joint Classification and Layout Analysis

Stage 2: Hybrid Content Parsing

Built on Qwen2.5-VL-3B backbone with:

📈 Performance

Dolphin-v2 achieves superior performance on comprehensive benchmarks:OmniDocBench (v1.5):

🎯 Supported Element Types

Dolphin-v2 supports 21 document element categories:

Element Type Description
sec_0 - sec_5 Hierarchical headings (title, level 1-5)
para Regular paragraphs
half_para Spanning paragraphs
equ Mathematical formulas (LaTeX)
tab Tables (HTML)
code Code blocks (with indentation)
fig Figures
cap Captions
list Lists
catalogue Catalogs
reference References
header / foot Headers/Footers
fnote Footnotes
watermark Watermarks
anno Annotations

📚 Citation

@inproceedings{dolphin2025,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
  booktitle={Proceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2025}
}

🙏 Acknowledgements

This model builds upon: