[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` by dsikka · Pull Request #5794 · vllm-project/vllm (original) (raw)

kzawora-intel added a commit to HabanaAI/vllm-fork that referenced this pull request

[Hardware][Intel] Optimize CPU backend and add more performance tips (vllm-project#4971)

Co-authored-by: Jianan Gu jianan.gu@intel.com

[Docs] Add 4th meetup slides (vllm-project#5509)
[Misc] Add vLLM version getter to utils (vllm-project#5098)
[CI/Build] Simplify OpenAI server setup in tests (vllm-project#5100)
[Doc] Update LLaVA docs (vllm-project#5437)

Co-authored-by: Roger Wang ywang@roblox.com

[Kernel] Factor out epilogues from cutlass kernels (vllm-project#5391)

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: youkaichao youkaichao@gmail.com Co-authored-by: zifeitong zifei.tong@parasail.io Co-authored-by: Robert Shaw 114415538+robertgshaw2-neuralmagic@users.noreply.github.com

[MISC] Remove FP8 warning (vllm-project#5472)

Co-authored-by: Philipp Moritz pcmoritz@gmail.com

Seperate dev requirements into lint and test (vllm-project#5474)
Revert "[Core] Remove unnecessary copies in flash attn backend" (vllm-project#5478)
[misc] fix format.sh (vllm-project#5511)
[CI/Build] Disable test_fp8.py (vllm-project#5508)
[Kernel] Disable CUTLASS kernels for fp8 (vllm-project#5505)
Add cuda_device_count_stateless (vllm-project#5473)
[Hardware][Intel] Support CPU inference with AVX2 ISA (vllm-project#5452)
[Misc] Fix arg names in quantizer script (vllm-project#5507)
bump version to v0.5.0.post1 (vllm-project#5522)
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label (vllm-project#5073)

Co-authored-by: simon-mo simon.mo@hey.com

[CI/Build] Disable LLaVA-NeXT CPU test (vllm-project#5529)
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (vllm-project#5516)
[Misc] Fix arg names (vllm-project#5524)
[ Misc ] Rs/compressed tensors cleanup (vllm-project#5432)

Co-authored-by: mgoin michael@neuralmagic.com Co-authored-by: Dipika Sikka dipikasikka1@gmail.com

[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (vllm-project#5401)
[mis] fix flaky test of test_cuda_device_count_stateless (vllm-project#5546)
[Core] Remove duplicate processing in async engine (vllm-project#5525)
[misc][distributed] fix benign error in is_in_the_same_node (vllm-project#5512)
[Docs] Add ZhenFund as a Sponsor (vllm-project#5548)
[Doc] Update documentation on Tensorizer (vllm-project#5471)
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (vllm-project#5460)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Bugfix] Fix typo in Pallas backend (vllm-project#5558)
[Core][Distributed] improve p2p cache generation (vllm-project#5528)
Add ccache to amd (vllm-project#5555)
[Core][Bugfix]: fix prefix caching for blockv2 (vllm-project#5364)

Signed-off-by: Lei Wen wenlei03@qiyi.com Co-authored-by: Lei Wen wenlei03@qiyi.com

[mypy] Enable type checking for test directory (vllm-project#5017)
[CI/Build] Test both text and token IDs in batched OpenAI Completions API (vllm-project#5568)
[misc] Do not allow to use lora with chunked prefill. (vllm-project#5538)

Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk

add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145)
[BugFix] Don't start a Ray cluster when not using Ray (vllm-project#5570)
[Fix] Correct OpenAI batch response format (vllm-project#5554)
Add basic correctness 2 GPU tests to 4 GPU pipeline (vllm-project#5518)
[CI][BugFix] Flip is_quant_method_supported condition (vllm-project#5577)
[build][misc] limit numpy version (vllm-project#5582)
[Doc] add debugging tips for crash and multi-node debugging (vllm-project#5581)
Fix w8a8 benchmark and add Llama-3-8B (vllm-project#5562)
[Model] Rename Phi3 rope scaling type (vllm-project#5595)
Correct alignment in the seq_len diagram. (vllm-project#5592)

Co-authored-by: Liqian Chen liqian.chen@deeplang.ai

[Kernel] compressed-tensors marlin 24 support (vllm-project#5435)
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed (vllm-project#5588)
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (vllm-project#3814)

Co-authored-by: Jiang Li jiang1.li@intel.com Co-authored-by: Abhilash Majumder abhilash.majumder@intel.com Co-authored-by: Abhilash Majumder 30946547+abhilash1910@users.noreply.github.com

[CI/BUILD] Support non-AVX512 vLLM building and testing (vllm-project#5574)
[CI] the readability of benchmarking and prepare for dashboard (vllm-project#5571)

[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (vllm-project#5571)

[bugfix][distributed] fix 16 gpus local rank arrangement (vllm-project#5604)
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (vllm-project#5584)
[Bugfix] Fix KV head calculation for MPT models when using GQA (vllm-project#5142)
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (vllm-project#5606)
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (vllm-project#5131)
[Model] Initialize Phi-3-vision support (vllm-project#4986)
[Kernel] Add punica dimensions for Granite 13b (vllm-project#5559)

Signed-off-by: Joe Runde Joseph.Runde@ibm.com

[misc][typo] fix typo (vllm-project#5620)
[Misc] Fix typo (vllm-project#5618)
[CI] Avoid naming different metrics with the same name in performance benchmark (vllm-project#5615)
[bugfix][distributed] improve p2p capability test (vllm-project#5612)

[bugfix][distributed] do not error if two processes do not agree on p2p capability (vllm-project#5612)

[Misc] Remove import from transformers logging (vllm-project#5625)
[CI/Build][Misc] Update Pytest Marker for VLMs (vllm-project#5623)
[ci] Deprecate original CI template (vllm-project#5624)

Signed-off-by: kevin kevin@anyscale.com

[Misc] Add OpenTelemetry support (vllm-project#4687)

This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here

[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (vllm-project#5542)
[ci] Setup Release pipeline and build release wheels with cache (vllm-project#5610)

Signed-off-by: kevin kevin@anyscale.com

[Model] LoRA support added for command-r (vllm-project#5178)
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (vllm-project#5639)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Doc] Added cerebrium as Integration option (vllm-project#5553)
[Bugfix] Fix CUDA version check for mma warning suppression (vllm-project#5642)
[Bugfix] Fix w8a8 benchmarks for int8 case (vllm-project#5643)
[Bugfix] Fix Phi-3 Long RoPE scaling implementation (vllm-project#5628)
[Bugfix] Added test for sampling repetition penalty bug. (vllm-project#5659)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (vllm-project#5641)
[misc][distributed] use 127.0.0.1 for single-node (vllm-project#5619)
[Model] Add FP8 kv cache for Qwen2 (vllm-project#5656)
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (vllm-project#5684)
[Misc]Add param max-model-len in benchmark_latency.py (vllm-project#5629)
[CI/Build] Add tqdm to dependencies (vllm-project#5680)
[ci] Add A100 queue into AWS CI template (vllm-project#5648)

Signed-off-by: kevin kevin@anyscale.com

[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (vllm-project#5688)
[ci][distributed] add tests for custom allreduce (vllm-project#5689)
[Bugfix] AsyncLLMEngine hangs with asyncio.run (vllm-project#5654)
[Doc] Update docker references (vllm-project#5614)

Signed-off-by: Rafael Vasquez rafvasq21@gmail.com

[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (vllm-project#5650)
[ci] Limit num gpus if specified for A100 (vllm-project#5694)

Signed-off-by: kevin kevin@anyscale.com

[Misc] Improve conftest (vllm-project#5681)
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (vllm-project#5703)
[Kernel] Update Cutlass int8 kernel configs for SM90 (vllm-project#5514)

Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com

[Model] Port over CLIPVisionModel for VLMs (vllm-project#5591)
[Kernel] Update Cutlass int8 kernel configs for SM80 (vllm-project#5275)

Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com

[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (vllm-project#5715)
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (vllm-project#5718)
[distributed][misc] use fork by default for mp (vllm-project#5669)
[Model] MLPSpeculator speculative decoding support (vllm-project#4947)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

Co-authored-by: Thomas Parnell tpa@zurich.ibm.com Co-authored-by: Nick Hill nickhill@us.ibm.com Co-authored-by: Davis Wertheimer Davis.Wertheimer@ibm.com

[Kernel] Add punica dimension for Qwen2 LoRA (vllm-project#5441)
[BugFix] Fix test_phi3v.py (vllm-project#5725)
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (vllm-project#5665)

Co-authored-by: Antoni Baum antoni.baum@protonmail.com

[Core][Distributed] add shm broadcast (vllm-project#5399)

Co-authored-by: Cody Yu hao.yu.cody@gmail.com

[Kernel][CPU] Add Quick gelu to CPU (vllm-project#5717)
[Doc] Documentation on supported hardware for quantization methods (vllm-project#5745)
[BugFix] exclude version 1.15.0 for modelscope (vllm-project#5668)
[ci][test] fix ca test in main (vllm-project#5746)
[LoRA] Add support for pinning lora adapters in the LRU cache (vllm-project#5603)
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (vllm-project#5616)
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (vllm-project#5710)

Co-authored-by: Roger Wang ywang@roblox.com

[Misc] Remove vllm-project#4789 workaround left in vllm/entrypoints/openai/run_batch.py (vllm-project#5756)
[Bugfix] Fix pin_lora error in TPU executor (vllm-project#5760)
[Docs][TPU] Add installation tip for TPU (vllm-project#5761)
[core][distributed] improve shared memory broadcast (vllm-project#5754)
[BugFix] [Kernel] Add Cutlass2x fallback kernels (vllm-project#5744)

Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com

[Distributed] Add send and recv helpers (vllm-project#5719)
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (vllm-project#5772)
[doc][faq] add warning to download models for every nodes (vllm-project#5783)
post-rebase api adjustments
[Doc] Add "Suggest edit" button to doc pages (vllm-project#5789)
[Doc] Add Phi-3-medium to list of supported models (vllm-project#5788)
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (vllm-project#5795)
[ci] Remove aws template (vllm-project#5757)

Signed-off-by: kevin kevin@anyscale.com

[Doc] Add notice about breaking changes to VLMs (vllm-project#5818)
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414)
add pin_lora to habana components
add WA for model loader
fix api mismatches with ray
tensor parallel fixes
workers cpu alignment fix
[Misc] Remove useless code in cpu_worker (vllm-project#5824)
prefill/decode metadata fixes
[Core] Add fault tolerance for RayTokenizerGroupPool (vllm-project#5748)
re-enable attn metadata trimming
worker_use_ray fix
[doc][distributed] add both gloo and nccl tests (vllm-project#5834)
[CI/Build] Add unit testing for FlexibleArgumentParser (vllm-project#5798)
[Misc] Update w4a16 compressed-tensors support to include w8a16 (vllm-project#5794)
[Hardware][TPU] Refactor TPU backend (vllm-project#5831)
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (vllm-project#5422)
[Hardware][TPU] Raise errors for unsupported sampling params (vllm-project#5850)
[CI/Build] Add E2E tests for MLPSpeculator (vllm-project#5791)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[Bugfix] Fix assertion in NeuronExecutor (vllm-project#5841)
[Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408)

Signed-off-by: Stephanie Wang swang@cs.berkeley.edu Signed-off-by: Stephanie swang@anyscale.com Co-authored-by: Stephanie swang@anyscale.com

[Misc][Doc] Add Example of using OpenAI Server with VLM (vllm-project#5832)
[bugfix][distributed] fix shm broadcast when the queue size is full (vllm-project#5801)
[Bugfix] Fix embedding to support 2D inputs (vllm-project#5829)
[Bugfix][TPU] Fix KV cache size calculation (vllm-project#5860)
[CI/Build] Refactor image test assets (vllm-project#5821)
[Kernel] Adding bias epilogue support for cutlass_scaled_mm (vllm-project#5560)

Co-authored-by: Chih-Chieh-Yang 7364402+cyang49@users.noreply.github.com Co-authored-by: Lucas Wilkinson lwilkinson@neuralmagic.com

[Frontend] Add tokenize/detokenize endpoints (vllm-project#5054)
[Hardware][TPU] Support parallel sampling & Swapping (vllm-project#5855)
[Bugfix][TPU] Fix CPU cache allocation (vllm-project#5869)
Support CPU inference with VSX PowerPC ISA (vllm-project#5652)
[doc] update usage of env var to avoid conflict (vllm-project#5873)
[Misc] Add example for LLaVA-NeXT (vllm-project#5879)
[BugFix] Fix cuda graph for MLPSpeculator (vllm-project#5875)

Co-authored-by: Abhinav Goyal abhinav.goyal@flipkart.com

[Doc] Add note about context length in Phi-3-Vision example (vllm-project#5887)
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly (vllm-project#5880)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com

[Model] Add base class for LoRA-supported models (vllm-project#5018)
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (vllm-project#5888)
[CI/Build] [1/3] Reorganize entrypoints tests (vllm-project#5526)
add collective crash WA
add comment to the weird mark_step
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (vllm-project#5896)
[doc][misc] add note for Kubernetes users (vllm-project#5916)
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens (vllm-project#5876)
[BugFix] Fix min_tokens behaviour for multiple eos tokens (vllm-project#5849)
[CI/Build] Fix Args for _get_logits_warper in Sampler Test (vllm-project#5922)
[Model] Add Gemma 2 (vllm-project#5908)
[core][misc] remove logical block (vllm-project#5882)
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (vllm-project#5932)
[Hardware][TPU] Optimize KV cache swapping (vllm-project#5878)
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. (vllm-project#5905)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com Co-authored-by: Roger Wang ywang@roblox.com

[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (vllm-project#5956)
[Core] Registry for processing model inputs (vllm-project#5214)

Co-authored-by: ywang96 ywang@roblox.com

Unmark fused_moe config json file as executable (vllm-project#5960)
[Hardware][Intel] OpenVINO vLLM backend (vllm-project#5379)
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (vllm-project#5894)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com

[CI/Build] [2/3] Reorganize entrypoints tests (vllm-project#5904)
[Distributed] Make it clear that % should not be in tensor dict keys. (vllm-project#5927)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com

[Spec Decode] Introduce DraftModelRunner (vllm-project#5799)
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (vllm-project#5931)
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (vllm-project#5921)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

Support Deepseek-V2 (vllm-project#4650)

Co-authored-by: Philipp Moritz pcmoritz@gmail.com

[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled (vllm-project#5936)
Unmark more files as executable (vllm-project#5962)
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (vllm-project#5963)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (vllm-project#4628)

Co-authored-by: LiuXiaoxuanPKU llilyliupku@gmail.com, bong-furiosa bongwon.jang@furiosa.ai

[Bugfix][TPU] Fix TPU sampler output (vllm-project#5978)
[Bugfix][TPU] Fix pad slot id (vllm-project#5977)
[Bugfix] fix missing last itl in openai completions benchmark (vllm-project#5926)
[Misc] Extend vLLM Metrics logging API (vllm-project#5925)

Co-authored-by: Antoni Baum antoni.baum@protonmail.com

[Kernel] Add punica dimensions for Granite 3b and 8b (vllm-project#5930)

Signed-off-by: Joe Runde joe@joerun.de

[Bugfix] Fix precisions in Gemma 1 (vllm-project#5913)
[Misc] Update Phi-3-Vision Example (vllm-project#5981)

Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com

[Bugfix] Support eos_token_id from config.json (vllm-project#5954)
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum (vllm-project#5974)
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (vllm-project#5939)
[ CI/Build ] Added E2E Test For Compressed Tensors (vllm-project#5839)

Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[CI/Build] Add TP test for vision models (vllm-project#5892)
[ CI/Build ] LM Eval Harness Based CI Testing (vllm-project#5838)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (vllm-project#5949)
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (vllm-project#5989)
[CI/Build] Reuse code for checking output consistency (vllm-project#5988)
[CI/Build] [3/3] Reorganize entrypoints tests (vllm-project#5966)
[ci][distributed] fix device count call

[ci][distributed] fix some cuda init that makes it necessary to use spawn (vllm-project#5991)

[Frontend]: Support base64 embedding (vllm-project#5935)

Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com

[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (vllm-project#5909)

Co-authored-by: sang sangcho@anyscale.com

[ CI ] Temporarily Disable Large LM-Eval Tests (vllm-project#6005)

Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>

[Misc] Fix get_min_capability (vllm-project#5971)
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (vllm-project#5940)

Co-authored-by: Robert Shaw <rshaw@neuralmagic>

[misc][cuda] use nvml to avoid accidentally cuda initialization (vllm-project#6007)
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348)
Revert test changes
cleanup
llm engine cleanup
utils.py cleanup
custom ops refactor
move xops to ops
remove vllm/hpu/attn_bias.py
whitespace fix
revert accidental changes in rmsnorm
Fix hpugraph hashing
add trim_attn_metadata comment
fix prompt bucketing:
[ CI ] Re-enable Large Model LM Eval (vllm-project#6031)
[doc][misc] remove deprecated api server in doc (vllm-project#6037)
[Misc] update benchmark backend for scalellm (vllm-project#6018)
[doc][misc] further lower visibility of simple api server (vllm-project#6041)

Co-authored-by: Simon Mo simon.mo@hey.com

[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (vllm-project#6039)
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (vllm-project#6029)
add FAQ doc under 'serving' (vllm-project#5946)
[Bugfix][Doc] Fix Doc Formatting (vllm-project#6048)
[Bugfix] Add explicit end_forward calls to flashinfer (vllm-project#6044)
[BugFix] Ensure worker model loop is always stopped at the right time (vllm-project#5987)
[Frontend] Relax api url assertion for openai benchmarking (vllm-project#6046)
[Model] Changes to MLPSpeculator to support tie_weights and input_scale (vllm-project#5965)

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com Co-authored-by: Joshua Rosenkranz jmrosenk@us.ibm.com

[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (vllm-project#5602)
[Frontend] Add template related params to request (vllm-project#5709)
[VLM] Remove image_input_type from VLM config (vllm-project#5852)

Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com Co-authored-by: Roger Wang ywang@roblox.com

[Doc] Reinstate doc dependencies (vllm-project#6061)
guard model loader wa for hpu

Signed-off-by: Thomas Parnell tpa@zurich.ibm.com Signed-off-by: Lei Wen wenlei03@qiyi.com Signed-off-by: Joe Runde Joseph.Runde@ibm.com Signed-off-by: kevin kevin@anyscale.com Signed-off-by: Rafael Vasquez rafvasq21@gmail.com Signed-off-by: Stephanie Wang swang@cs.berkeley.edu Signed-off-by: Stephanie swang@anyscale.com Signed-off-by: Xiaowei Jiang xwjiang2010@gmail.com Signed-off-by: Joe Runde joe@joerun.de Co-authored-by: Li, Jiang jiang1.li@intel.com Co-authored-by: Jianan Gu jianan.gu@intel.com Co-authored-by: Woosuk Kwon woosuk.kwon@berkeley.edu Co-authored-by: Cyrus Leung tlleungac@connect.ust.hk Co-authored-by: Roger Wang ywang@roblox.com Co-authored-by: Tyler Michael Smith tyler@neuralmagic.com Co-authored-by: Michael Goin michael@neuralmagic.com Co-authored-by: youkaichao youkaichao@gmail.com Co-authored-by: zifeitong zifei.tong@parasail.io Co-authored-by: Robert Shaw 114415538+robertgshaw2-neuralmagic@users.noreply.github.com Co-authored-by: Cody Yu hao.yu.cody@gmail.com Co-authored-by: Philipp Moritz pcmoritz@gmail.com Co-authored-by: Antoni Baum antoni.baum@protonmail.com Co-authored-by: Jie Fu (傅杰) jiefu@tencent.com Co-authored-by: Allen.Dou allen.dou@hotmail.com Co-authored-by: Simon Mo simon.mo@hey.com Co-authored-by: Kuntai Du kuntai@uchicago.edu Co-authored-by: Dipika Sikka dipikasikka1@gmail.com Co-authored-by: Sanger Steel sangersteel@gmail.com Co-authored-by: Thomas Parnell tpa@zurich.ibm.com Co-authored-by: leiwen83 leiwen83@users.noreply.github.com Co-authored-by: Lei Wen wenlei03@qiyi.com Co-authored-by: SangBin Cho rkooo567@gmail.com Co-authored-by: Alexander Matveev 59768536+alexm-neuralmagic@users.noreply.github.com Co-authored-by: Nick Hill nickhill@us.ibm.com Co-authored-by: Amit Garg gargamit@microsoft.com Co-authored-by: Charles Riggins liqianchen123@foxmail.com Co-authored-by: Liqian Chen liqian.chen@deeplang.ai Co-authored-by: zhyncs me@zhyncs.com Co-authored-by: Kunshang Ji kunshang.ji@intel.com Co-authored-by: Abhilash Majumder abhilash.majumder@intel.com Co-authored-by: Abhilash Majumder 30946547+abhilash1910@users.noreply.github.com Co-authored-by: Bruce Fontaine bruce@2.7182.net Co-authored-by: zifeitong zifeitong@gmail.com Co-authored-by: sroy745 142070531+sroy745@users.noreply.github.com Co-authored-by: Isotr0py 2037008807@qq.com Co-authored-by: Joe Runde joe@joerun.de Co-authored-by: Chang Su chang.s.su@oracle.com Co-authored-by: Roger Wang 136131678+ywang96@users.noreply.github.com Co-authored-by: Kevin H. Luu kevin@anyscale.com Co-authored-by: Ronen Schaffer ronen.schaffer@ibm.com Co-authored-by: sergey-tinkoff 167607910+sergey-tinkoff@users.noreply.github.com Co-authored-by: milo157 43028253+milo157@users.noreply.github.com Co-authored-by: Shukant Pal SukantK2002@outlook.com Co-authored-by: Hongxia Yang 62075498+hongxiayang@users.noreply.github.com Co-authored-by: DearPlanet junsong.zhang2021.work@outlook.com Co-authored-by: Rafael Vasquez rafvasq21@gmail.com Co-authored-by: Varun Sundar Rabindranath varunsundar08@gmail.com Co-authored-by: Varun Sundar Rabindranath varun@neuralmagic.com Co-authored-by: Joshua Rosenkranz joshua.rosenkranz@gmail.com Co-authored-by: Davis Wertheimer Davis.Wertheimer@ibm.com Co-authored-by: Jinzhen Lin linjinzhen@hotmail.com Co-authored-by: Jee Li pandaleefree@163.com Co-authored-by: rohithkrn rohith.nallamaddi@gmail.com Co-authored-by: Murali Andoorveedu 37849411+andoorve@users.noreply.github.com Co-authored-by: Woo-Yeon Lee wooyeonlee0@gmail.com Co-authored-by: Matt Wong 156021403+mawong-amd@users.noreply.github.com Co-authored-by: aws-patlange 90803007+aws-patlange@users.noreply.github.com Co-authored-by: Stephanie Wang swang@cs.berkeley.edu Co-authored-by: Stephanie swang@anyscale.com Co-authored-by: Luka Govedič ProExpertProg@users.noreply.github.com Co-authored-by: Chih-Chieh-Yang 7364402+cyang49@users.noreply.github.com Co-authored-by: Lucas Wilkinson lwilkinson@neuralmagic.com Co-authored-by: sasha0552 admin@sasha0552.org Co-authored-by: Chip Kerchner 49959681+ChipKerchner@users.noreply.github.com Co-authored-by: Abhinav Goyal abhinav.goyal@flipkart.com Co-authored-by: xwjiang2010 87673679+xwjiang2010@users.noreply.github.com Co-authored-by: Divakar Verma 137818590+divakar-amd@users.noreply.github.com Co-authored-by: Ilya Lavrenov ilya.lavrenov@intel.com Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: wangding zeng 155410488+zwd003@users.noreply.github.com Co-authored-by: Lily Liu lilyliupku@gmail.com Co-authored-by: LiuXiaoxuanPKU llilyliupku@gmail.com, bong-furiosa bongwon.jang@furiosa.ai Co-authored-by: mcalman 68564154+mcalman@users.noreply.github.com Co-authored-by: William Lin SolitaryThinker@users.noreply.github.com Co-authored-by: Cyrus Leung cyrus.tl.leung@gmail.com Co-authored-by: llmpros 10524065+llmpros@users.noreply.github.com Co-authored-by: sang sangcho@anyscale.com Co-authored-by: Avshalom Manevich 12231371+avshalomman@users.noreply.github.com Co-authored-by: James Whedbee jamesw@telnyx.com Co-authored-by: Joshua Rosenkranz jmrosenk@us.ibm.com Co-authored-by: danieljannai21 100521221+danieljannai21@users.noreply.github.com

[Misc] Update w4a16 compressed-tensors support to include w8a16 by dsikka · Pull Request #5794 · vllm-project/vllm (original) (raw)

[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` by dsikka · Pull Request #5794 · vllm-project/vllm (original) (raw)