GitHub - opendatalab/MinerU-Diffusion: A diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding. (original) (raw)

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

20260325-114624.mp4

📰 News

[2026/3/24] 🔥 We release MinerU-Diffusion-V1 — a 2.5B diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.

🎯 Roadmap

Our long-term goal is to build efficient and reliable 2.5B diffusion-based decoding for document OCR.

✅ Release MinerU-Diffusion-V1: A 2.5B diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding.
✅ Support SGLang to accommodate diffusion computation.
✅ Complete the Nano-vLLM adaptation used by our nano_dvlm engine for single-GPU inference.
✅ Complete the Gradio-based interactive demo implementation.
⬜ Release MinerU-Diffusion-V2: More Small, More Faster, More Elegant, More Powerful!
⬜ Release Training Code

💡 TL;DR

MinerU-Diffusion reframes document OCR as an inverse rendering problem and replaces slow, error-prone autoregressive decoding with parallel diffusion decoding.

By introducing block-wise diffusion, uncertainty-driven curriculum learning, it achieves up to 3.2× faster decoding while improving robustness and reducing reliance on language priors.

Diffusion decoding progressively reconstructs structured text from masked tokens under visual conditioning: black tokens are confirmed, red tokens are being updated, and yellow tokens remain masked, enabling parallel generation with global consistency, in contrast to autoregressive left-to-right decoding.

Training of MinerU-Diffusion. Left: the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. Right: the structured block-attention mask used during training, where tokens attend bidirectionally within each block and causally to all preceding blocks, enabling parallel diffusion refinement within blocks while preserving coarse autoregressive structure across blocks.

📈 Performance

MinerU-Diffusion provides a flexible accuracy-throughput trade-off through threshold control. Compared with MinerU2.5, it achieves up to 3.26x TPS, while also offering practical operating points such as 2.12x speedup with 99.9% relative accuracy and 3.01x speedup with 98.8% relative accuracy.

🗂️ Repository Layout

MinerU-Diffusion/
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
├── assets/
│   ├── banner.png
│   ├── decode.png
│   ├── homepage-demo.mp4
│   ├── image.png
│   ├── performance_tradeoff.jpeg
│   └── train.png
├── docs/
│   ├── MinerU-Diffusion-V1.pdf
│   ├── gradio/
│   │   ├── .gitignore
│   │   ├── app.py
│   │   ├── diffusion_hf.py
│   │   ├── mineru_hf.py
│   │   ├── runtime_paths.example.json
│   │   └── speed_compare/
│   └── sglang/
│       ├── README.md
│       ├── mineru_request.py
│       ├── run_infer.sh
│       └── run_server.sh
├── engines/
│   ├── __init__.py
│   ├── hf/
│   │   ├── __init__.py
│   │   └── runner.py
│   ├── nano_dvlm/
│   │   ├── .gitignore
│   │   ├── LICENSE
│   │   ├── __init__.py
│   │   ├── nanovllm/
│   │   ├── bench.py
│   │   ├── example.py
│   │   ├── llm_outputs/
│   │   └── pyproject.toml
│   └── sglang/
│       └── __init__.py
├── mineru_diffusion/
│   ├── __init__.py
│   ├── configuration_mineru_diffusion.py
│   ├── modeling_mineru_diffusion.py
│   ├── processing_mineru_diffusion.py
│   └── utils/
│       ├── __init__.py
│       └── bbox.py
├── requirements.txt
├── scripts/
│   ├── run_end2end.py
│   ├── run_end2end.sh
│   ├── run_inference.py
│   ├── run_inference.sh
│   └── run_sglang_server.sh

🌐 Online Experience

Official online web application

The official web application provides a more complete product experience, including a polished interface and richer features. Login is required.

Gradio-based online demo

A lightweight Gradio WebUI for trying the core parsing workflow. No login is required.

🛠️ Environment Setup

For a first-time setup, we recommend creating a dedicated Conda environment named dmineru and installing the dependencies below.

Recommended core versions:

Python 3.12.12
torch 2.8.0+cu128
torchvision 0.23.0+cu128
torchaudio 2.8.0+cu128
transformers 4.52.1
triton 3.4.0
flash-attn 2.8.3
liger-kernel 0.6.4

Create and install the environment:

conda create -n dmineru python=3.12 -y conda activate dmineru

pip install --upgrade pip pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 pip install "transformers>=4.52.1" wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl pip install -r requirements.txt

The root-level requirements.txt covers:

the Hugging Face inference path (ENGINE=hf)
the built-in Nano-DVLM path (ENGINE=nano_dvlm)
the client-side request path for the OpenAI-compatible SGLang endpoint (ENGINE=sglang)

Notes:

The requirements file uses the CUDA 12.8 PyTorch wheel index and pins a tested set of core package versions for first-time setup.
flash-attn==2.8.3 must match your local CUDA, compiler, and PyTorch stack. If a prebuilt wheel is not available for your machine, install a compatible wheel manually or build it from source before retrying pip install -r requirements.txt.
The sglang server binary itself is not installed by the root requirements.txt. If you want to run scripts/run_sglang_server.sh, install sglang in a dedicated environment or SGLang checkout first, then follow docs/sglang/README.md.

📦 Model Weights

Download the model weights before running inference, then point MODEL_PATH to the local checkpoint directory.

Hugging Face: opendatalab/MinerU-Diffusion-V1-0320-2.5B
ModelScope: OpenDataLab/MinerU-Diffusion-V1-0320-2.5B

Example:

MODEL_PATH=/path/to/MinerU-Diffusion-V1-0320-2.5B

🧩 Prompt Types

MinerU-Diffusion supports multiple prompt types for different document parsing targets. Each prompt is designed for a specific output structure rather than a single generic free-form response.

Prompt Type	Function	Input Setting	Output Format	Example Output
Layout Detection	Page-level layout parsing with region coordinates, category tags, and rotation direction.	Resized to 1036 x 1036.	Bounding boxes plus element labels and rotation tags.	<\| box_start	>100 200 300 400<	box_end	> <	ref_start	>title<
Text Recognition	Plain OCR text extraction.	Native resolution, 4 to 2048 image tokens.	Raw OCR text.	The results of the analyses of the uncertainty of the field data and related assumptions are shown in Figs 13 and 14.
Formula Recognition	Formula extraction and conversion into LaTeX.	Native resolution, 4 to 2048 image tokens.	LaTeX formula content.	\hat{F} = \operatorname{Concat}([F_1, F_2, \dots, F_n])
Table Recognition	Structured table extraction for downstream processing.	Native resolution, 4 to 2048 image tokens.	OTSL (Open Table Structure Language).	Site Cl NO3 SO4 Na ...

🚀 Inference

Replace MODEL_PATH and IMAGE_PATH with your own paths before running.

There are two local entry scripts:

scripts/run_inference.sh: single prompt inference for one engine (hf, nano_dvlm, or sglang)
scripts/run_end2end.sh: two-stage page parsing with layout detection plus per-block content extraction, producing merged markdown and optional structured artifacts

Transformers Example

import torch from transformers import AutoModel, AutoProcessor, AutoTokenizer

model_id = "Niujunbo2002/MinerU-Diffusion-V1-0320-2.5B" image_path = "path/to/page.png"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) processor = AutoProcessor.from_pretrained( model_id, trust_remote_code=True, use_fast=False, ) model = AutoModel.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, ).eval().to("cuda")

messages = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, { "role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": "\nText Recognition:"}, ], }, ]

prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True) if isinstance(prompt_text, tuple): prompt_text = prompt_text[0]

inputs = processor( images=[image_path], text=prompt_text, truncation=True, max_length=4096, return_tensors="pt", ) input_ids = inputs["input_ids"].to(torch.long).to("cuda") pixel_values = inputs["pixel_values"].to(torch.bfloat16).to("cuda") image_grid_thw = inputs.get("image_grid_thw") if image_grid_thw is not None: image_grid_thw = image_grid_thw.to(torch.long).to("cuda")

with torch.no_grad(): generate_outputs = model.generate( pixel_values=pixel_values, image_grid_thw=image_grid_thw, input_ids=input_ids, mask_token_id=tokenizer.convert_tokens_to_ids("<|MASK|>"), denoising_steps=32, gen_length=1024, block_length=32, temperature=1.0, remasking_strategy="low_confidence_dynamic", dynamic_threshold=0.95, tokenizer=tokenizer, stopping_criteria=["<|endoftext|>", "<|im_end|>"], )

output_ids = generate_outputs[0] if isinstance(generate_outputs, tuple) else generate_outputs text = tokenizer.decode(output_ids[0], skip_special_tokens=False) for stop in ("<|endoftext|>", "<|im_end|>"): text = text.split(stop, 1)[0]

print(text.strip())

HF Engine

cd /path/to/MinerU-Diffusion ENGINE=hf
MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-image.png
bash scripts/run_inference.sh

Nano-DVLM Engine

cd /path/to/MinerU-Diffusion ENGINE=nano_dvlm
MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-image.png
bash scripts/run_inference.sh

SGLang Engine

Start the SGLang server first:

cd /path/to/MinerU-Diffusion MODEL_PATH=/path/to/MinerU-Diffusion-model
bash scripts/run_sglang_server.sh

Then send the request through the unified inference entry:

cd /path/to/MinerU-Diffusion ENGINE=sglang
MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-image.png
SGLANG_SERVER_URL=http://127.0.0.1:31002/v1/chat/completions
bash scripts/run_inference.sh

For a more detailed SGLang guide, including environment setup, tokenizer requirements, server launch options, and request examples, see docs/sglang/README.md.

📄 End-to-End Parsing

scripts/run_end2end.py runs the full two-step document parsing pipeline on a single page image:

Detect page layout regions.
Crop each detected block and run the matching prompt for text, table, or formula extraction.
Merge retained blocks into a markdown result.

Use the wrapper script below for local execution:

cd /path/to/MinerU-Diffusion MODEL_PATH=/path/to/MinerU-Diffusion-model
IMAGE_PATH=/path/to/input-page.png
OUTPUT_PATH=/path/to/output.md
BLOCKS_JSON_PATH=/path/to/output-blocks.json
SAVE_LAYOUT_IMAGE=1
LAYOUT_IMAGE_PATH=/path/to/output-layout.png
bash scripts/run_end2end.sh

Common environment variables:

MODEL_PATH: local MinerU-Diffusion model directory
IMAGE_PATH: input page image
OUTPUT_PATH: optional markdown output file; if empty, markdown is printed to stdout
BLOCKS_JSON_PATH: optional JSON file with metrics and parsed blocks
SAVE_LAYOUT_IMAGE=1: save a layout visualization with bounding boxes
LAYOUT_IMAGE_PATH: optional explicit path for the layout visualization
KEEP_PARATEXT=1: keep header, footer, page number, and other paratext blocks
VERBOSE=1: print per-block progress to stderr

Advanced generation controls are also exposed as environment variables in scripts/run_end2end.sh, including DTYPE, MAX_LENGTH, LAYOUT_GEN_LENGTH, CONTENT_GEN_LENGTH, TABLE_GEN_LENGTH, FORMULA_GEN_LENGTH, BLOCK_SIZE, TEMPERATURE, REMASK_STRATEGY, and DYNAMIC_THRESHOLD.

🤝 Acknowledgement

This work is heavily built on the following open-source models:

MinerU, Qwen2-VL, SDAR, and LLaDA.

These acceleration methods (engines):

SGLang, Nano-vLLM as the upstream basis for our nano_dvlm adaptation, and jetengine,

and theoretical foundations:

MDLM, DiffuLLaMA, Block Diffusion.

For the training code, we also reference dLLM-RL.

📚 Citation

If you find our paper and code useful in your research, please consider giving a star and citation.

@article{dong2026minerudiffusion, title={MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding}, author={Dong, Hejun and Niu, Junbo and Wang, Bin and Zeng, Weijun and Zhang, Wentao and He, Conghui}, journal={arXiv preprint arXiv:2603.22458}, year={2026} }

@article{niu2025mineru2, title={Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing}, author={Niu, Junbo and Liu, Zheng and Gu, Zhuangcheng and Wang, Bin and Ouyang, Linke and Zhao, Zhiyuan and Chu, Tao and He, Tianyao and Wu, Fan and Zhang, Qintong and others}, journal={arXiv preprint arXiv:2509.22186}, year={2025} }

@article{wang2024mineru, title={Mineru: An open-source solution for precise document content extraction}, author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others}, journal={arXiv preprint arXiv:2409.18839}, year={2024} }

@article{he2024opendatalab, title={Opendatalab: Empowering general artificial intelligence with open datasets}, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, journal={arXiv preprint arXiv:2407.13773}, year={2024} }

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

For related upstream projects and ecosystem tools, see the links below.