GitHub - FireRedTeam/FireRed-OCR (original) (raw)

Hugging Face Hugging Face Model HF Demo ModelScope Model MS Demo

πŸ€— HuggingFace | πŸ€– ModelScope | πŸ–₯️ Demo | πŸ“„ Technical Report | 🐈 GitHub


Figure 1: Performance comparison on the OmniDocBench v1.5 benchmark. FireRed-OCR achieves state-of-the-art performance among end-to-end solutions, ranking first with a score above 92%.

FireRed-OCR is a systematic framework designed to specialize general Large Vision-Language Models (LVLMs) into high-performance, pixel-precise structural document parsing experts.

General VLMs frequently suffer from "Structural Hallucination" (e.g., disordered rows, invented formulas) when processing complex documents. FireRed-OCR addresses this by shifting the paradigm from "impressionist" text generation to "structural engineering," achieving State-of-the-Art (SOTA) results on authoritative benchmarks like OmniDocBench v1.5.

✨ Key Features

πŸ“° News

πŸ—‚οΈ Model Zoo

Models Base Description Download Link
FireRed-OCR-2B Qwen3-VL-2B-Instruct Lightweight version achieving 92.94% Overall on OmniDocBench v1.5. πŸ€— HuggingFace

πŸ—οΈ Model Architecture

The FireRed-OCR framework transforms a general VLM into a structural expert through a three-stage progressive training strategy:

  1. Stage 1: Multi-task Pre-alignment: Trains the model on detection, region recognition, and layout-to-markdown tasks to ground visual perception.
  2. Stage 2: Specialized SFT: Fine-tunes on a high-quality, standardized Markdown dataset to ensure logical consistency and hierarchical expression.
  3. Stage 3: Format-Constrained GRPO: Applies Reinforcement Learning with specific rewards for Formula Syntax, Table Integrity, Hierarchical Closure, and Text Accuracy.

⚑️ Quick Start

FireRed-OCR is based on the Qwen3-VL architecture. You can use the following code snippets to generate structured Markdown from document images.

1. Install Dependencies

pip install transformers pip install qwen-vl-utils git clone https://github.com/FireRedTeam/FireRed-OCR.git cd FireRed-OCR

2. Inference

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from conv_for_infer import generate_conv

Load the model

model = Qwen3VLForConditionalGeneration.from_pretrained( "FireRedTeam/FireRed-OCR", torch_dtype=torch.bfloat16, device_map="auto", )

We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.

model = Qwen3VLForConditionalGeneration.from_pretrained(

"Qwen/FireRed-OCR,

dtype=torch.bfloat16,

attn_implementation="flash_attention_2",

device_map="auto",

)

processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")

Prepare Input

image_path = "./examples/complex_table.png" messages = generate_conv(image_path)

Preparation for inference

inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ) inputs = inputs.to(model.device)

Inference: Generation of the output

generated_ids = model.generate(**inputs, max_new_tokens=8192) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)

πŸ“Š Benchmark

We evaluate FireRed-OCR on OmniDocBench v1.5 and FireRedBench.

OmniDocBench v1.5

Model Overall ↑ TextEdit ↓ FormulaCDM ↑ TableTEDs ↑ TableTEDS_s ↑ R-orderEdit ↓
Pipeline
Dolphin 74.67 0.125 67.85 68.70 77.77 0.124
Dolphin-1.5 83.21 0.092 80.78 78.06 84.10 0.080
PP-StructureV3 86.73 0.073 85.79 81.68 89.48 0.073
MonkeyOCR-pro-1.2B 86.96 0.084 85.02 84.24 89.02 0.130
MonkeyOCR-3B 87.13 0.075 87.45 81.39 85.92 0.129
MonkeyOCR-pro-3B 88.85 0.075 87.25 86.78 90.63 0.128
MinerU2.5 90.67 0.047 88.46 88.22 92.38 0.044
PaddleOCR-VL 92.86 0.035 91.22 90.89 94.76 0.043
PaddleOCR-VL-1.5 94.50 0.035 94.21 92.76 95.79 0.042
GLM-OCR 94.60 - - - - -
End-to-end
OCRFlux-3B 74.82 0.193 68.03 75.75 80.23 0.202
Mistral OCR 78.83 0.164 82.84 70.03 78.04 0.144
InternVL3-76B 80.33 0.131 83.42 70.64 77.74 0.113
POINTS-Reader 80.98 0.134 79.20 77.13 81.66 0.145
olmOCR-7B 81.79 0.096 86.04 68.92 74.77 0.121
Qwen3-VL-2B 81.87 0.100 85.87 69.77 74.37 0.115
InternVL3.5-241B 82.67 0.142 87.23 75.00 81.28 0.125
GPT-5.2 85.50 0.123 86.11 82.66 87.35 0.099
MinerU2-VLM 85.56 0.078 80.95 83.54 87.66 0.086
Nanonets-OCR-s 85.59 0.093 85.90 80.14 85.57 0.108
Qwen2.5-VL-72B 87.02 0.094 88.27 82.15 86.22 0.102
DeepSeek-OCR 87.36 0.073 84.14 85.25 89.01 0.085
dots.ocr 88.41 0.048 83.22 86.78 90.62 0.053
OCRVerse 88.56 0.058 86.91 84.55 88.45 0.071
Qwen3-VL-235B-A22B 89.15 0.069 88.14 86.21 90.55 0.068
Gemini-3.0 Pro 90.33 0.065 89.18 88.28 90.29 0.071
Qwen3.5-397B-A17B 90.80 - - - - -
DeepSeek-OCR 2 91.09 0.048 90.31 87.75 92.06 0.057
FireRed-OCR-2B 92.94 0.032 91.71 90.31 93.81 0.041

FireRedBench

Model Overall ↑ TextEdit ↓ FormulaCDM ↑ TableTEDs ↑ TableTEDS_s ↑ R-orderEdit ↓
GPT-5.2πŸ”’ 68.09 0.238 66.33 61.74 68.00 0.380
Gemini-3.0 ProπŸ”’ 79.68 0.169 80.11 75.82 82.73 0.353
Pipeline
GLM-OCR 74.33 0.309 82.53 71.35 79.93 0.456
PaddleOCR-VL-1.5 76.47 0.291 92.37 66.15 74.39 0.453
End-to-end
DeepSeek-OCR 2 61.61 0.290 58.78 55.06 59.42 0.437
dots.ocr 72.93 0.240 82.53 60.25 64.08 0.419
Qwen3-VL-2B-Instruct 65.58 0.283 75.19 49.85 55.66 0.388
FireRed-OCR-2B 74.62 0.248 83.02 65.63 72.30 0.430

Additional Benchmarks

Model OmniDocBench v1.5 FireRedBench OCRBench(TextRec) TEDS_TEST PubTabNet
GPT-5.2πŸ”’ 85.50 68.09 93.0 67.6 84.4
Gemini-3.0 ProπŸ”’ 90.33 79.68 91.9 81.8 91.4
Pipeline
MinerU2.5 90.67 - - 85.4 88.4
PaddleOCR-VL-1.5 94.50 76.47 53.5 / 87.0 83.3 84.6
GLM-OCR 94.60 74.33 61.0 / 95.0 86.0 85.2
End-to-end
dots.ocr 88.41 72.93 92.1 62.4 71.0
DeepSeek-OCR 2 91.09 61.61 48.5 - -
FireRed-OCR-2B 92.94 74.62 93.5 80.6 77.0

For PaddleOCR-VL-1.5 and GLM-OCR on OCRBench, scores are reported as API / pure VLM.

πŸ“œ License Agreement

The code and the weights of FireRed-OCR are licensed under Apache 2.0.

πŸ–ŠοΈ Citation

We kindly encourage citation of our work if you find it useful.

@article{fireredocr, title={FireRed-OCR Technical Report}, author={Super Intelligence Team, Xiaohongshu Inc.}, year={2026}, eprint={2603.01840}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.01840} }

⚠️ Ethics Statement

FireRed-OCR is a technical tool designed for document digitization and structural parsing.

🀝 Acknowledgements

We would like to thank the developers of the amazing open-source projects, including Qwen-VL, PaddleOCR, olmOCR and the broader OCR community.

⭐ Star History

Star History Chart