GitHub - opendatalab/MinerU-Diffusion: A diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding. (original) (raw)

MinerU-Diffusion

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

20260325-114624.mp4

📰 News

🎯 Roadmap

Our long-term goal is to build efficient and reliable 2.5B diffusion-based decoding for document OCR.


💡 TL;DR

MinerU-Diffusion reframes document OCR as an inverse rendering problem and replaces slow, error-prone autoregressive decoding with parallel diffusion decoding.

By introducing block-wise diffusion, uncertainty-driven curriculum learning, it achieves up to 3.2× faster decoding while improving robustness and reducing reliance on language priors.

Diffusion Decoding

Diffusion decoding progressively reconstructs structured text from masked tokens under visual conditioning: black tokens are confirmed, red tokens are being updated, and yellow tokens remain masked, enabling parallel generation with global consistency, in contrast to autoregressive left-to-right decoding.

Overview

Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block-attention mask used during training, where tokens attend bidirectionally within each block and causally to all preceding blocks, enabling parallel diffusion refinement within blocks while preserving coarse autoregressive structure across blocks.

📈 Performance

Performance Trade-off

MinerU-Diffusion provides a flexible accuracy-throughput trade-off through threshold control. Compared with MinerU2.5, it achieves up to 3.26x TPS, while also offering practical operating points such as 2.12x speedup with 99.9% relative accuracy and 3.01x speedup with 98.8% relative accuracy.

🗂️ Repository Layout

MinerU-Diffusion/
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
├── assets/
│   ├── banner.png
│   ├── decode.png
│   ├── homepage-demo.mp4
│   ├── image.png
│   ├── performance_tradeoff.jpeg
│   └── train.png
├── docs/
│   ├── MinerU-Diffusion-V1.pdf
│   ├── gradio/
│   │   ├── .gitignore
│   │   ├── app.py
│   │   ├── diffusion_hf.py
│   │   ├── mineru_hf.py
│   │   ├── runtime_paths.example.json
│   │   └── speed_compare/
│   └── sglang/
│       ├── README.md
│       ├── mineru_request.py
│       ├── run_infer.sh
│       └── run_server.sh
├── engines/
│   ├── __init__.py
│   ├── hf/
│   │   ├── __init__.py
│   │   └── runner.py
│   ├── nano_dvlm/
│   │   ├── .gitignore
│   │   ├── LICENSE
│   │   ├── __init__.py
│   │   ├── nanovllm/
│   │   ├── bench.py
│   │   ├── example.py
│   │   ├── llm_outputs/
│   │   └── pyproject.toml
│   └── sglang/
│       └── __init__.py
├── mineru_diffusion/
│   ├── __init__.py
│   ├── configuration_mineru_diffusion.py
│   ├── modeling_mineru_diffusion.py
│   ├── processing_mineru_diffusion.py
│   └── utils/
│       ├── __init__.py
│       └── bbox.py
├── requirements.txt
├── scripts/
│   ├── run_end2end.py
│   ├── run_end2end.sh
│   ├── run_inference.py
│   ├── run_inference.sh
│   └── run_sglang_server.sh

🌐 Online Experience

Official online web application

The official web application provides a more complete product experience, including a polished interface and richer features. Login is required.

Gradio-based online demo

A lightweight Gradio WebUI for trying the core parsing workflow. No login is required.

🛠️ Environment Setup

For a first-time setup, we recommend creating a dedicated Conda environment named dmineru and installing the dependencies below.

Recommended core versions:

Create and install the environment:

conda create -n dmineru python=3.12 -y conda activate dmineru

pip install --upgrade pip pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 pip install "transformers>=4.52.1" wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl pip install -r requirements.txt

The root-level requirements.txt covers:

Notes:

📦 Model Weights

Download the model weights before running inference, then point MODEL_PATH to the local checkpoint directory.

Example:

MODEL_PATH=/path/to/MinerU-Diffusion-V1-0320-2.5B

🧩 Prompt Types

MinerU-Diffusion supports multiple prompt types for different document parsing targets. Each prompt is designed for a specific output structure rather than a single generic free-form response.

Prompt Type Function Input Setting Output Format Example Output
Layout Detection Page-level layout parsing with region coordinates, category tags, and rotation direction. Resized to 1036 x 1036. Bounding boxes plus element labels and rotation tags. <| box_start >100 200 300 400< box_end > < ref_start >title<
Text Recognition Plain OCR text extraction. Native resolution, 4 to 2048 image tokens. Raw OCR text. The results of the analyses of the uncertainty of the field data and related assumptions are shown in Figs 13 and 14.
Formula Recognition Formula extraction and conversion into LaTeX. Native resolution, 4 to 2048 image tokens. LaTeX formula content. \hat{F} = \operatorname{Concat}([F_1, F_2, \dots, F_n])
Table Recognition Structured table extraction for downstream processing. Native resolution, 4 to 2048 image tokens. OTSL (Open Table Structure Language). Site Cl NO3 SO4 Na ...

🚀 Inference

Replace MODEL_PATH and IMAGE_PATH with your own paths before running.

There are two local entry scripts:

Transformers Example

import torch from transformers import AutoModel, AutoProcessor, AutoTokenizer

model_id = "Niujunbo2002/MinerU-Diffusion-V1-0320-2.5B" image_path = "path/to/page.png"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) processor = AutoProcessor.from_pretrained( model_id, trust_remote_code=True, use_fast=False, ) model = AutoModel.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, ).eval().to("cuda")

messages = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, { "role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": "\nText Recognition:"}, ], }, ]

prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True) if isinstance(prompt_text, tuple): prompt_text = prompt_text[0]

inputs = processor( images=[image_path], text=prompt_text, truncation=True, max_length=4096, return_tensors="pt", ) input_ids = inputs["input_ids"].to(torch.long).to("cuda") pixel_values = inputs["pixel_values"].to(torch.bfloat16).to("cuda") image_grid_thw = inputs.get("image_grid_thw") if image_grid_thw is not None: image_grid_thw = image_grid_thw.to(torch.long).to("cuda")

with torch.no_grad(): generate_outputs = model.generate( pixel_values=pixel_values, image_grid_thw=image_grid_thw, input_ids=input_ids, mask_token_id=tokenizer.convert_tokens_to_ids("<|MASK|>"), denoising_steps=32, gen_length=1024, block_length=32, temperature=1.0, remasking_strategy="low_confidence_dynamic", dynamic_threshold=0.95, tokenizer=tokenizer, stopping_criteria=["<|endoftext|>", "<|im_end|>"], )

output_ids = generate_outputs[0] if isinstance(generate_outputs, tuple) else generate_outputs text = tokenizer.decode(output_ids[0], skip_special_tokens=False) for stop in ("<|endoftext|>", "<|im_end|>"): text = text.split(stop, 1)[0]

print(text.strip())

HF Engine

cd /path/to/MinerU-Diffusion ENGINE=hf
MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-image.png
bash scripts/run_inference.sh

Nano-DVLM Engine

cd /path/to/MinerU-Diffusion ENGINE=nano_dvlm
MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-image.png
bash scripts/run_inference.sh

SGLang Engine

Start the SGLang server first:

cd /path/to/MinerU-Diffusion MODEL_PATH=/path/to/MinerU-Diffusion-model
bash scripts/run_sglang_server.sh

Then send the request through the unified inference entry:

cd /path/to/MinerU-Diffusion ENGINE=sglang
MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-image.png
SGLANG_SERVER_URL=http://127.0.0.1:31002/v1/chat/completions
bash scripts/run_inference.sh

For a more detailed SGLang guide, including environment setup, tokenizer requirements, server launch options, and request examples, see docs/sglang/README.md.

📄 End-to-End Parsing

scripts/run_end2end.py runs the full two-step document parsing pipeline on a single page image:

  1. Detect page layout regions.
  2. Crop each detected block and run the matching prompt for text, table, or formula extraction.
  3. Merge retained blocks into a markdown result.

Use the wrapper script below for local execution:

cd /path/to/MinerU-Diffusion MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-page.png
OUTPUT_PATH=/path/to/output.md
BLOCKS_JSON_PATH=/path/to/output-blocks.json
SAVE_LAYOUT_IMAGE=1
LAYOUT_IMAGE_PATH=/path/to/output-layout.png
bash scripts/run_end2end.sh

Common environment variables:

Advanced generation controls are also exposed as environment variables in scripts/run_end2end.sh, including DTYPE, MAX_LENGTH, LAYOUT_GEN_LENGTH, CONTENT_GEN_LENGTH, TABLE_GEN_LENGTH, FORMULA_GEN_LENGTH, BLOCK_SIZE, TEMPERATURE, REMASK_STRATEGY, and DYNAMIC_THRESHOLD.

🤝 Acknowledgement

This work is heavily built on the following open-source models:

MinerU, Qwen2-VL, SDAR, and LLaDA.

These acceleration methods (engines):

SGLang, Nano-vLLM as the upstream basis for our nano_dvlm adaptation, and jetengine,

and theoretical foundations:

MDLM, DiffuLLaMA, Block Diffusion.

For the training code, we also reference dLLM-RL.

📚 Citation

If you find our paper and code useful in your research, please consider giving a star and citation.

@article{dong2026minerudiffusion, title={MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding}, author={Dong, Hejun and Niu, Junbo and Wang, Bin and Zeng, Weijun and Zhang, Wentao and He, Conghui}, journal={arXiv preprint arXiv:2603.22458}, year={2026} }

@article{niu2025mineru2, title={Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing}, author={Niu, Junbo and Liu, Zheng and Gu, Zhuangcheng and Wang, Bin and Ouyang, Linke and Zhao, Zhiyuan and Chu, Tao and He, Tianyao and Wu, Fan and Zhang, Qintong and others}, journal={arXiv preprint arXiv:2509.22186}, year={2025} }

@article{wang2024mineru, title={Mineru: An open-source solution for precise document content extraction}, author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others}, journal={arXiv preprint arXiv:2409.18839}, year={2024} }

@article{he2024opendatalab, title={Opendatalab: Empowering general artificial intelligence with open datasets}, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, journal={arXiv preprint arXiv:2407.13773}, year={2024} }

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

For related upstream projects and ecosystem tools, see the links below.