Fix quantized Falcon-H1 model loading issues by shengliangxu · Pull Request #32728 · vllm-project/vllm (original) (raw)

This change fixes the loading of quantized Falcon-H1 model. The change is originally targeting the loading of Nvidia Model Optimizer quantized models but the issues are universal to quantized model from other quantization libraries.

Specifically:

pass the quant_config to sub modules
allow remapping of the kv-cache scaling factors. For ModelOpt, we put the k_scale and v_scale under k_proj and v_proj, but the vLLM library seeks them under attention

tested using lm_eval on a B200 system:

Without quantization:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7483	±	0.0120
		strict-match	5	exact_match	↑	0.7839	±	0.0113

With fp8 quantization using Nvidia ModelOpt:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7369	±	0.0121
		strict-match	5	exact_match	↑	0.7976	±	0.0111

Signed-off-by: Shengliang Xu shengliangx@nvidia.com

[ gemini-code-assist[bot] ](/apps/gemini-code-assist)

Bot reviewed Jan 20, 2026

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request

Feb 3, 2026

Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Signed-off-by: Pai 416932041@qq.com

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request

Feb 3, 2026

Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Signed-off-by: Pai 416932041@qq.com

gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request

Feb 5, 2026

Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Signed-off-by: felix01.yu felix01.yu@vipshop.com

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request

May 10, 2026

Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request

May 15, 2026

Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request

May 15, 2026

Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk

0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request

May 19, 2026

Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})