Fix quantized Falcon-H1 model loading issues by shengliangxu · Pull Request #32728 · vllm-project/vllm (original) (raw)
This change fixes the loading of quantized Falcon-H1 model. The change is originally targeting the loading of Nvidia Model Optimizer quantized models but the issues are universal to quantized model from other quantization libraries.
Specifically:
pass the quant_config to sub modules
allow remapping of the kv-cache scaling factors. For ModelOpt, we put the k_scale and v_scale under k_proj and v_proj, but the vLLM library seeks them under attention
tested using lm_eval on a B200 system:
Without quantization:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.7483 | ± | 0.0120 |
| strict-match | 5 | exact_match | ↑ | 0.7839 | ± | 0.0113 |
With fp8 quantization using Nvidia ModelOpt:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.7369 | ± | 0.0121 |
| strict-match | 5 | exact_match | ↑ | 0.7976 | ± | 0.0111 |
Signed-off-by: Shengliang Xu shengliangx@nvidia.com
[](/apps/gemini-code-assist)
Bot reviewed Jan 20, 2026
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request
Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Signed-off-by: Pai 416932041@qq.com
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request
Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Signed-off-by: Pai 416932041@qq.com
gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request
Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Signed-off-by: felix01.yu felix01.yu@vipshop.com
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request
Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request
Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request
Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request
Signed-off-by: Shengliang Xu shengliangx@nvidia.com Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})