[v1] Re-init input batch for multiple kv cache groups by heheda12345 · Pull Request #18654 · vllm-project/vllm (original) (raw)
Signed-off-by: Chen Zhang zhangch99@outlook.com
…_batch
Signed-off-by: Chen Zhang zhangch99@outlook.com
WoosukKwon added the ready
ONLY add when PR is ready to merge/full CI is needed
label
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: wangxiaoxin (A) w00664509@china.huawei.com
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com
nanxingMy pushed a commit to nanxingMy/vllm-ascend that referenced this pull request
What this PR does / why we need it?
- Re-enable sleep mode test
- Fix nightly performance benchmark workflow
- Fix model-runner-v1 bug for upstream change
Signed-off-by: wangli wangli858794774@gmail.com Signed-off-by: nanxing 1014662416@qq.com
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request
Signed-off-by: Chen Zhang zhangch99@outlook.com
wenyili added a commit to wenyili/vllm that referenced this pull request
Move InputBatch creation from GPUModelRunner.init to initialize_kv_cache (via a new initialize_input_batch method), so it is built with the final block sizes from kv_cache_config rather than a placeholder.
The original early initialization was a workaround for a UVA pinned-memory reuse bug (see vllm-project#18298): GPTQ's process_weights_after_loading replaced parameter objects, causing the old PackedvLLMParameter (which held the only Python reference to cpu_data) to be GC'd and its pinned memory returned to CachingHostAllocator. InputBatch, if created after load_model, would then reuse that memory for block_table_cpu, aliasing live GPTQ weight CUDA views.
This is now safe because the C++ lambda in csrc/cuda_view.cu captures cpu_tensor by value (base = cpu_tensor{}), keeping it alive for the lifetime of the UVA CUDA view regardless of Python-side GC. PR vllm-project#36461 confirmed this by removing the offload+quantization reinit guard added in vllm-project#18654.
The may_reinitialize_input_batch method is renamed to initialize_input_batch and the conditional block-size comparison is dropped — InputBatch is always created fresh in initialize_kv_cache. This also fixes a latent bug where cp_kv_cache_interleave_size was omitted from the reinit path.
Co-authored-by: Claude Signed-off-by: liwenyi liwenyi199111@gmail.com
Signed-off-by: liwenyi lwy.lwy@163.com
wenyili added a commit to wenyili/vllm that referenced this pull request
Move InputBatch creation from GPUModelRunner.init to initialize_kv_cache (via a new initialize_input_batch method), so it is built with the final block sizes from kv_cache_config rather than a placeholder.
The original early initialization was a workaround for a UVA pinned-memory reuse bug (see vllm-project#18298): GPTQ's process_weights_after_loading replaced parameter objects, causing the old PackedvLLMParameter (which held the only Python reference to cpu_data) to be GC'd and its pinned memory returned to CachingHostAllocator. InputBatch, if created after load_model, would then reuse that memory for block_table_cpu, aliasing live GPTQ weight CUDA views.
This is now safe because the C++ lambda in csrc/cuda_view.cu captures cpu_tensor by value (base = cpu_tensor{}), keeping it alive for the lifetime of the UVA CUDA view regardless of Python-side GC. PR vllm-project#36461 confirmed this by removing the offload+quantization reinit guard added in vllm-project#18654.
The may_reinitialize_input_batch method is renamed to initialize_input_batch and the conditional block-size comparison is dropped — InputBatch is always created fresh in initialize_kv_cache. This also fixes a latent bug where cp_kv_cache_interleave_size was omitted from the reinit path.
Co-authored-by: Claude Signed-off-by: liwenyi liwenyi199111@gmail.com
Signed-off-by: liwenyi lwy.lwy@163.com
wenyili added a commit to wenyili/vllm that referenced this pull request
Move InputBatch creation from GPUModelRunner.init to initialize_kv_cache (via a new initialize_input_batch method), so it is built with the final block sizes from kv_cache_config rather than a placeholder.
The original early initialization was a workaround for a UVA pinned-memory reuse bug (see vllm-project#18298): GPTQ's process_weights_after_loading replaced parameter objects, causing the old PackedvLLMParameter (which held the only Python reference to cpu_data) to be GC'd and its pinned memory returned to CachingHostAllocator. InputBatch, if created after load_model, would then reuse that memory for block_table_cpu, aliasing live GPTQ weight CUDA views.
This is now safe because the C++ lambda in csrc/cuda_view.cu captures cpu_tensor by value (base = cpu_tensor{}), keeping it alive for the lifetime of the UVA CUDA view regardless of Python-side GC. PR vllm-project#36461 confirmed this by removing the offload+quantization reinit guard added in vllm-project#18654.
The may_reinitialize_input_batch method is renamed to initialize_input_batch and the conditional block-size comparison is dropped — InputBatch is always created fresh in initialize_kv_cache. This also fixes a latent bug where cp_kv_cache_interleave_size was omitted from the reinit path.
Co-authored-by: Claude Signed-off-by: liwenyi lwy.lwy@163.com
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})