vllm (original) (raw)

and others added 3 commits

June 25, 2024 20:31

Robert Shaw added 2 commits

June 25, 2024 20:43

Co-authored-by: Michael Goin michael@neuralmagic.com

robertgshaw2-redhat changed the title~~[ CI ] Added E2E Test For Compressed Tensors~~ [ CI/Build ] Added E2E Test For Compressed Tensors

Jun 25, 2024

Robert Shaw added 4 commits

June 26, 2024 01:14

auto-merge was automatically disabled

June 28, 2024 18:34

Pull Request is not mergeable

robertgshaw2-redhat added a commit to neuralmagic/nm-vllm that referenced this pull request

Jul 1, 2024

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic>

prashantgupta24 pushed a commit to opendatahub-io/vllm that referenced this pull request

Jul 1, 2024

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic>

prashantgupta24 pushed a commit to opendatahub-io/vllm that referenced this pull request

Jul 1, 2024

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic>

kzawora-intel added a commit to HabanaAI/vllm-fork that referenced this pull request

Jul 2, 2024

[Hardware][Intel] Optimize CPU backend and add more performance tips (vllm-project#4971)

Co-authored-by: Jianan Gu jianan.gu@intel.com

[Docs] Add 4th meetup slides (vllm-project#5509)
[Misc] Add vLLM version getter to utils (vllm-project#5098)
[CI/Build] Simplify OpenAI server setup in tests (vllm-project#5100)
[Doc] Update LLaVA docs (vllm-project#5437)

Co-authored-by: Roger Wang ywang@roblox.com

[Kernel] Factor out epilogues from cutlass kernels (vllm-project#5391)

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: youkaichao youkaichao@gmail.com Co-authored-by: zifeitong zifei.tong@parasail.io Co-authored-by: Robert Shaw 114415538+robertgshaw2-neuralmagic@users.noreply.github.com

[MISC] Remove FP8 warning (vllm-project#5472)

Co-authored-by: Philipp Moritz pcmoritz@gmail.com

Seperate dev requirements into lint and test (vllm-project#5474)
Revert "[Core] Remove unnecessary copies in flash attn backend" (vllm-project#5478)
[misc] fix format.sh (vllm-project#5511)
[CI/Build] Disable test_fp8.py (vllm-project#5508)
[Kernel] Disable CUTLASS kernels for fp8 (vllm-project#5505)
Add cuda_device_count_stateless (vllm-project#5473)
[Hardware][Intel] Support CPU inference with AVX2 ISA (vllm-project#5452)
[Misc] Fix arg names in quantizer script (vllm-project#5507)
bump version to v0.5.0.post1 (vllm-project#5522)
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label (vllm-project#5073)

Co-authored-by: simon-mo simon.mo@hey.com

[CI/Build] Disable LLaVA-NeXT CPU test (vllm-project#5529)
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (vllm-project#5516)
[Misc] Fix arg names (vllm-project#5524)
[ Misc ] Rs/compressed tensors cleanup (vllm-project#5432)

Co-authored-by: mgoin michael@neuralmagic.com Co-authored-by: Dipika Sikka dipikasikka1@gmail.com

[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (vllm-project#5401)
[mis] fix flaky test of test_cuda_device_count_stateless (vllm-project#5546)
[Core] Remove duplicate processing in async engine (vllm-project#5525)
[misc][distributed] fix benign error in is_in_the_same_node (vllm-project#5512)
[Docs] Add ZhenFund as a Sponsor (vllm-project#5548)
[Doc] Update documentation on Tensorizer (vllm-project#5471)
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (vllm-project#5460)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Bugfix] Fix typo in Pallas backend (vllm-project#5558)
[Core][Distributed] improve p2p cache generation (vllm-project#5528)
Add ccache to amd (vllm-project#5555)
[Core][Bugfix]: fix prefix caching for blockv2 (vllm-project#5364)

Signed-off-by: Lei Wen wenlei03@qiyi.com Co-authored-by: Lei Wen wenlei03@qiyi.com

[mypy] Enable type checking for test directory (vllm-project#5017)
[CI/Build] Test both text and token IDs in batched OpenAI Completions API (vllm-project#5568)
[misc] Do not allow to use lora with chunked prefill. (vllm-project#5538)

Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk

add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145)
[BugFix] Don't start a Ray cluster when not using Ray (vllm-project#5570)
[Fix] Correct OpenAI batch response format (vllm-project#5554)
Add basic correctness 2 GPU tests to 4 GPU pipeline (vllm-project#5518)
[CI][BugFix] Flip is_quant_method_supported condition (vllm-project#5577)
[build][misc] limit numpy version (vllm-project#5582)
[Doc] add debugging tips for crash and multi-node debugging (vllm-project#5581)
Fix w8a8 benchmark and add Llama-3-8B (vllm-project#5562)
[Model] Rename Phi3 rope scaling type (vllm-project#5595)
Correct alignment in the seq_len diagram. (vllm-project#5592)

Co-authored-by: Liqian Chen liqian.chen@deeplang.ai

[Kernel] compressed-tensors marlin 24 support (vllm-project#5435)
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed (vllm-project#5588)
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (vllm-project#3814)

Co-authored-by: Jiang Li jiang1.li@intel.com Co-authored-by: Abhilash Majumder abhilash.majumder@intel.com Co-authored-by: Abhilash Majumder 30946547+abhilash1910@users.noreply.github.com

[CI/BUILD] Support non-AVX512 vLLM building and testing (vllm-project#5574)
[CI] the readability of benchmarking and prepare for dashboard (vllm-project#5571)

[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (vllm-project#5571)

[bugfix][distributed] fix 16 gpus local rank arrangement (vllm-project#5604)
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (vllm-project#5584)
[Bugfix] Fix KV head calculation for MPT models when using GQA (vllm-project#5142)
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (vllm-project#5606)
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (vllm-project#5131)
[Model] Initialize Phi-3-vision support (vllm-project#4986)
[Kernel] Add punica dimensions for Granite 13b (vllm-project#5559)

Signed-off-by: Joe Runde Joseph.Runde@ibm.com

[misc][typo] fix typo (vllm-project#5620)
[Misc] Fix typo (vllm-project#5618)
[CI] Avoid naming different metrics with the same name in performance benchmark (vllm-project#5615)
[bugfix][distributed] improve p2p capability test (vllm-project#5612)

[bugfix][distributed] do not error if two processes do not agree on p2p capability (vllm-project#5612)

[Misc] Remove import from transformers logging (vllm-project#5625)
[CI/Build][Misc] Update Pytest Marker for VLMs (vllm-project#5623)
[ci] Deprecate original CI template (vllm-project#5624)

Signed-off-by: kevin kevin@anyscale.com

[Misc] Add OpenTelemetry support (vllm-project#4687)

This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here

[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (vllm-project#5542)
[ci] Setup Release pipeline and build release wheels with cache (vllm-project#5610)

Signed-off-by: kevin kevin@anyscale.com

[Model] LoRA support added for command-r (vllm-project#5178)
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (vllm-project#5639)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Doc] Added cerebrium as Integration option (vllm-project#5553)
[Bugfix] Fix CUDA version check for mma warning suppression (vllm-project#5642)
[Bugfix] Fix w8a8 benchmarks for int8 case (vllm-project#5643)
[Bugfix] Fix Phi-3 Long RoPE scaling implementation (vllm-project#5628)
[Bugfix] Added test for sampling repetition penalty bug. (vllm-project#5659)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (vllm-project#5641)
[misc][distributed] use 127.0.0.1 for single-node (vllm-project#5619)
[Model] Add FP8 kv cache for Qwen2 (vllm-project#5656)
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (vllm-project#5684)
[Misc]Add param max-model-len in benchmark_latency.py (vllm-project#5629)
[CI/Build] Add tqdm to dependencies (vllm-project#5680)
[ci] Add A100 queue into AWS CI template (vllm-project#5648)

Signed-off-by: kevin kevin@anyscale.com

[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (vllm-project#5688)
[ci][distributed] add tests for custom allreduce (vllm-project#5689)
[Bugfix] AsyncLLMEngine hangs with asyncio.run (vllm-project#5654)
[Doc] Update docker references (vllm-project#5614)

Signed-off-by: Rafael Vasquez rafvasq21@gmail.com

[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (vllm-project#5650)
[ci] Limit num gpus if specified for A100 (vllm-project#5694)

Signed-off-by: kevin kevin@anyscale.com

[Misc] Improve conftest (vllm-project#5681)
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (vllm-project#5703)
[Kernel] Update Cutlass int8 kernel configs for SM90 (vllm-project#5514)

Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com

[Model] Port over CLIPVisionModel for VLMs (vllm-project#5591)
[Kernel] Update Cutlass int8 kernel configs for SM80 (vllm-project#5275)

Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com

[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (vllm-project#5715)
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (vllm-project#5718)
[distributed][misc] use fork by default for mp (vllm-project#5669)
[Model] MLPSpeculator speculative decoding support (vllm-project#4947)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

Co-authored-by: Thomas Parnell tpa@zurich.ibm.com Co-authored-by: Nick Hill nickhill@us.ibm.com Co-authored-by: Davis Wertheimer Davis.Wertheimer@ibm.com

[Kernel] Add punica dimension for Qwen2 LoRA (vllm-project#5441)
[BugFix] Fix test_phi3v.py (vllm-project#5725)
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (vllm-project#5665)

Co-authored-by: Antoni Baum antoni.baum@protonmail.com

[Core][Distributed] add shm broadcast (vllm-project#5399)

Co-authored-by: Cody Yu hao.yu.cody@gmail.com

[Kernel][CPU] Add Quick gelu to CPU (vllm-project#5717)
[Doc] Documentation on supported hardware for quantization methods (vllm-project#5745)
[BugFix] exclude version 1.15.0 for modelscope (vllm-project#5668)
[ci][test] fix ca test in main (vllm-project#5746)
[LoRA] Add support for pinning lora adapters in the LRU cache (vllm-project#5603)
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (vllm-project#5616)
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (vllm-project#5710)

Co-authored-by: Roger Wang ywang@roblox.com

[Misc] Remove vllm-project#4789 workaround left in vllm/entrypoints/openai/run_batch.py (vllm-project#5756)
[Bugfix] Fix pin_lora error in TPU executor (vllm-project#5760)
[Docs][TPU] Add installation tip for TPU (vllm-project#5761)
[core][distributed] improve shared memory broadcast (vllm-project#5754)
[BugFix] [Kernel] Add Cutlass2x fallback kernels (vllm-project#5744)

Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com

[Distributed] Add send and recv helpers (vllm-project#5719)
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (vllm-project#5772)
[doc][faq] add warning to download models for every nodes (vllm-project#5783)
post-rebase api adjustments
[Doc] Add "Suggest edit" button to doc pages (vllm-project#5789)
[Doc] Add Phi-3-medium to list of supported models (vllm-project#5788)
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (vllm-project#5795)
[ci] Remove aws template (vllm-project#5757)

Signed-off-by: kevin kevin@anyscale.com

[Doc] Add notice about breaking changes to VLMs (vllm-project#5818)
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414)
add pin_lora to habana components
add WA for model loader
fix api mismatches with ray
tensor parallel fixes
workers cpu alignment fix
[Misc] Remove useless code in cpu_worker (vllm-project#5824)
prefill/decode metadata fixes
[Core] Add fault tolerance for RayTokenizerGroupPool (vllm-project#5748)
re-enable attn metadata trimming
worker_use_ray fix
[doc][distributed] add both gloo and nccl tests (vllm-project#5834)
[CI/Build] Add unit testing for FlexibleArgumentParser (vllm-project#5798)
[Misc] Update w4a16 compressed-tensors support to include w8a16 (vllm-project#5794)
[Hardware][TPU] Refactor TPU backend (vllm-project#5831)
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (vllm-project#5422)
[Hardware][TPU] Raise errors for unsupported sampling params (vllm-project#5850)
[CI/Build] Add E2E tests for MLPSpeculator (vllm-project#5791)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Bugfix] Fix assertion in NeuronExecutor (vllm-project#5841)
[Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408)

Signed-off-by: Stephanie Wang swang@cs.berkeley.edu Signed-off-by: Stephanie swang@anyscale.com Co-authored-by: Stephanie swang@anyscale.com

[Misc][Doc] Add Example of using OpenAI Server with VLM (vllm-project#5832)
[bugfix][distributed] fix shm broadcast when the queue size is full (vllm-project#5801)
[Bugfix] Fix embedding to support 2D inputs (vllm-project#5829)
[Bugfix][TPU] Fix KV cache size calculation (vllm-project#5860)
[CI/Build] Refactor image test assets (vllm-project#5821)
[Kernel] Adding bias epilogue support for cutlass_scaled_mm (vllm-project#5560)

Co-authored-by: Chih-Chieh-Yang 7364402+cyang49@users.noreply.github.com Co-authored-by: Lucas Wilkinson lwilkinson@neuralmagic.com

[Frontend] Add tokenize/detokenize endpoints (vllm-project#5054)
[Hardware][TPU] Support parallel sampling & Swapping (vllm-project#5855)
[Bugfix][TPU] Fix CPU cache allocation (vllm-project#5869)
Support CPU inference with VSX PowerPC ISA (vllm-project#5652)
[doc] update usage of env var to avoid conflict (vllm-project#5873)
[Misc] Add example for LLaVA-NeXT (vllm-project#5879)
[BugFix] Fix cuda graph for MLPSpeculator (vllm-project#5875)

Co-authored-by: Abhinav Goyal abhinav.goyal@flipkart.com

[Doc] Add note about context length in Phi-3-Vision example (vllm-project#5887)
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly (vllm-project#5880)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com

[Model] Add base class for LoRA-supported models (vllm-project#5018)
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (vllm-project#5888)
[CI/Build] [1/3] Reorganize entrypoints tests (vllm-project#5526)
add collective crash WA
add comment to the weird mark_step
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (vllm-project#5896)
[doc][misc] add note for Kubernetes users (vllm-project#5916)
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens (vllm-project#5876)
[BugFix] Fix min_tokens behaviour for multiple eos tokens (vllm-project#5849)
[CI/Build] Fix Args for _get_logits_warper in Sampler Test (vllm-project#5922)
[Model] Add Gemma 2 (vllm-project#5908)
[core][misc] remove logical block (vllm-project#5882)
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (vllm-project#5932)
[Hardware][TPU] Optimize KV cache swapping (vllm-project#5878)
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. (vllm-project#5905)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com Co-authored-by: Roger Wang ywang@roblox.com

[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (vllm-project#5956)
[Core] Registry for processing model inputs (vllm-project#5214)

Co-authored-by: ywang96 ywang@roblox.com

Unmark fused_moe config json file as executable (vllm-project#5960)
[Hardware][Intel] OpenVINO vLLM backend (vllm-project#5379)
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (vllm-project#5894)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[CI/Build] [2/3] Reorganize entrypoints tests (vllm-project#5904)
[Distributed] Make it clear that % should not be in tensor dict keys. (vllm-project#5927)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com

[Spec Decode] Introduce DraftModelRunner (vllm-project#5799)
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (vllm-project#5931)
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (vllm-project#5921)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

Support Deepseek-V2 (vllm-project#4650)

Co-authored-by: Philipp Moritz pcmoritz@gmail.com

[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled (vllm-project#5936)
Unmark more files as executable (vllm-project#5962)
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (vllm-project#5963)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (vllm-project#4628)

Co-authored-by: LiuXiaoxuanPKU llilyliupku@gmail.com, bong-furiosa bongwon.jang@furiosa.ai

[Bugfix][TPU] Fix TPU sampler output (vllm-project#5978)
[Bugfix][TPU] Fix pad slot id (vllm-project#5977)
[Bugfix] fix missing last itl in openai completions benchmark (vllm-project#5926)
[Misc] Extend vLLM Metrics logging API (vllm-project#5925)

Co-authored-by: Antoni Baum antoni.baum@protonmail.com

[Kernel] Add punica dimensions for Granite 3b and 8b (vllm-project#5930)

Signed-off-by: Joe Runde joe@joerun.de

[Bugfix] Fix precisions in Gemma 1 (vllm-project#5913)
[Misc] Update Phi-3-Vision Example (vllm-project#5981)

Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com

[Bugfix] Support eos_token_id from config.json (vllm-project#5954)
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum (vllm-project#5974)
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (vllm-project#5939)
[ CI/Build ] Added E2E Test For Compressed Tensors (vllm-project#5839)

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[CI/Build] Add TP test for vision models (vllm-project#5892)
[ CI/Build ] LM Eval Harness Based CI Testing (vllm-project#5838)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (vllm-project#5949)
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (vllm-project#5989)
[CI/Build] Reuse code for checking output consistency (vllm-project#5988)
[CI/Build] [3/3] Reorganize entrypoints tests (vllm-project#5966)
[ci][distributed] fix device count call

[ci][distributed] fix some cuda init that makes it necessary to use spawn (vllm-project#5991)

[Frontend]: Support base64 embedding (vllm-project#5935)

Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com

[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (vllm-project#5909)

Co-authored-by: sang sangcho@anyscale.com

[ CI ] Temporarily Disable Large LM-Eval Tests (vllm-project#6005)

Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>

[Misc] Fix get_min_capability (vllm-project#5971)
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (vllm-project#5940)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[misc][cuda] use nvml to avoid accidentally cuda initialization (vllm-project#6007)
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348)
Revert test changes
cleanup
llm engine cleanup
utils.py cleanup
custom ops refactor
move xops to ops
remove vllm/hpu/attn_bias.py
whitespace fix
revert accidental changes in rmsnorm
Fix hpugraph hashing
add trim_attn_metadata comment
fix prompt bucketing:
[ CI ] Re-enable Large Model LM Eval (vllm-project#6031)
[doc][misc] remove deprecated api server in doc (vllm-project#6037)
[Misc] update benchmark backend for scalellm (vllm-project#6018)
[doc][misc] further lower visibility of simple api server (vllm-project#6041)

Co-authored-by: Simon Mo simon.mo@hey.com

[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (vllm-project#6039)
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (vllm-project#6029)
add FAQ doc under 'serving' (vllm-project#5946)
[Bugfix][Doc] Fix Doc Formatting (vllm-project#6048)
[Bugfix] Add explicit end_forward calls to flashinfer (vllm-project#6044)
[BugFix] Ensure worker model loop is always stopped at the right time (vllm-project#5987)
[Frontend] Relax api url assertion for openai benchmarking (vllm-project#6046)
[Model] Changes to MLPSpeculator to support tie_weights and input_scale (vllm-project#5965)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com Co-authored-by: Joshua Rosenkranz jmrosenk@us.ibm.com

[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (vllm-project#5602)
[Frontend] Add template related params to request (vllm-project#5709)
[VLM] Remove image_input_type from VLM config (vllm-project#5852)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com Co-authored-by: Roger Wang ywang@roblox.com

[Doc] Reinstate doc dependencies (vllm-project#6061)
guard model loader wa for hpu

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com Signed-off-by: Lei Wen wenlei03@qiyi.com Signed-off-by: Joe Runde Joseph.Runde@ibm.com Signed-off-by: kevin kevin@anyscale.com Signed-off-by: Rafael Vasquez rafvasq21@gmail.com Signed-off-by: Stephanie Wang swang@cs.berkeley.edu Signed-off-by: Stephanie swang@anyscale.com Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com Signed-off-by: Joe Runde joe@joerun.de Co-authored-by: Li, Jiang jiang1.li@intel.com Co-authored-by: Jianan Gu jianan.gu@intel.com Co-authored-by: Woosuk Kwon woosuk.kwon@berkeley.edu Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Co-authored-by: Roger Wang ywang@roblox.com Co-authored-by: Tyler Michael Smith tyler@neuralmagic.com Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: youkaichao youkaichao@gmail.com Co-authored-by: zifeitong zifei.tong@parasail.io Co-authored-by: Robert Shaw 114415538+robertgshaw2-neuralmagic@users.noreply.github.com Co-authored-by: Cody Yu hao.yu.cody@gmail.com Co-authored-by: Philipp Moritz pcmoritz@gmail.com Co-authored-by: Antoni Baum antoni.baum@protonmail.com Co-authored-by: Jie Fu (傅杰) jiefu@tencent.com Co-authored-by: Allen.Dou allen.dou@hotmail.com Co-authored-by: Simon Mo simon.mo@hey.com Co-authored-by: Kuntai Du kuntai@uchicago.edu Co-authored-by: Dipika Sikka dipikasikka1@gmail.com Co-authored-by: Sanger Steel sangersteel@gmail.com Co-authored-by: Thomas Parnell tpa@zurich.ibm.com Co-authored-by: leiwen83 leiwen83@users.noreply.github.com Co-authored-by: Lei Wen wenlei03@qiyi.com Co-authored-by: SangBin Cho rkooo567@gmail.com Co-authored-by: Alexander Matveev 59768536+alexm-neuralmagic@users.noreply.github.com Co-authored-by: Nick Hill nickhill@us.ibm.com Co-authored-by: Amit Garg gargamit@microsoft.com Co-authored-by: Charles Riggins liqianchen123@foxmail.com Co-authored-by: Liqian Chen liqian.chen@deeplang.ai Co-authored-by: zhyncs me@zhyncs.com Co-authored-by: Kunshang Ji kunshang.ji@intel.com Co-authored-by: Abhilash Majumder abhilash.majumder@intel.com Co-authored-by: Abhilash Majumder 30946547+abhilash1910@users.noreply.github.com Co-authored-by: Bruce Fontaine bruce@2.7182.net Co-authored-by: zifeitong zifeitong@gmail.com Co-authored-by: sroy745 142070531+sroy745@users.noreply.github.com Co-authored-by: Isotr0py 2037008807@qq.com Co-authored-by: Joe Runde joe@joerun.de Co-authored-by: Chang Su chang.s.su@oracle.com Co-authored-by: Roger Wang 136131678+ywang96@users.noreply.github.com Co-authored-by: Kevin H. Luu kevin@anyscale.com Co-authored-by: Ronen Schaffer ronen.schaffer@ibm.com Co-authored-by: sergey-tinkoff 167607910+sergey-tinkoff@users.noreply.github.com Co-authored-by: milo157 43028253+milo157@users.noreply.github.com Co-authored-by: Shukant Pal SukantK2002@outlook.com Co-authored-by: Hongxia Yang 62075498+hongxiayang@users.noreply.github.com Co-authored-by: DearPlanet junsong.zhang2021.work@outlook.com Co-authored-by: Rafael Vasquez rafvasq21@gmail.com Co-authored-by: Varun Sundar Rabindranath varunsundar08@gmail.com Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com Co-authored-by: Joshua Rosenkranz joshua.rosenkranz@gmail.com Co-authored-by: Davis Wertheimer Davis.Wertheimer@ibm.com Co-authored-by: Jinzhen Lin linjinzhen@hotmail.com Co-authored-by: Jee Li pandaleefree@163.com Co-authored-by: rohithkrn rohith.nallamaddi@gmail.com Co-authored-by: Murali Andoorveedu 37849411+andoorve@users.noreply.github.com Co-authored-by: Woo-Yeon Lee wooyeonlee0@gmail.com Co-authored-by: Matt Wong 156021403+mawong-amd@users.noreply.github.com Co-authored-by: aws-patlange 90803007+aws-patlange@users.noreply.github.com Co-authored-by: Stephanie Wang swang@cs.berkeley.edu Co-authored-by: Stephanie swang@anyscale.com Co-authored-by: Luka Govedič ProExpertProg@users.noreply.github.com Co-authored-by: Chih-Chieh-Yang 7364402+cyang49@users.noreply.github.com Co-authored-by: Lucas Wilkinson lwilkinson@neuralmagic.com Co-authored-by: sasha0552 admin@sasha0552.org Co-authored-by: Chip Kerchner 49959681+ChipKerchner@users.noreply.github.com Co-authored-by: Abhinav Goyal abhinav.goyal@flipkart.com Co-authored-by: xwjiang2010 87673679+xwjiang2010@users.noreply.github.com Co-authored-by: Divakar Verma 137818590+divakar-amd@users.noreply.github.com Co-authored-by: Ilya Lavrenov ilya.lavrenov@intel.com Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: wangding zeng 155410488+zwd003@users.noreply.github.com Co-authored-by: Lily Liu lilyliupku@gmail.com Co-authored-by: LiuXiaoxuanPKU llilyliupku@gmail.com, bong-furiosa bongwon.jang@furiosa.ai Co-authored-by: mcalman 68564154+mcalman@users.noreply.github.com Co-authored-by: William Lin SolitaryThinker@users.noreply.github.com Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com Co-authored-by: llmpros 10524065+llmpros@users.noreply.github.com Co-authored-by: sang sangcho@anyscale.com Co-authored-by: Avshalom Manevich 12231371+avshalomman@users.noreply.github.com Co-authored-by: James Whedbee jamesw@telnyx.com Co-authored-by: Joshua Rosenkranz jmrosenk@us.ibm.com Co-authored-by: danieljannai21 100521221+danieljannai21@users.noreply.github.com

xjpang pushed a commit to xjpang/vllm that referenced this pull request

Jul 8, 2024

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic>

xjpang pushed a commit to xjpang/vllm that referenced this pull request

Jul 24, 2024

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request

Oct 26, 2024

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic> Signed-off-by: Alvant alvasian@yandex.ru

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})