IST Austria Distributed Algorithms and Systems Lab (original) (raw)

  1. Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
    Python 2.1k 167
  2. FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
    Python 817 66
  3. Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
    Python 791 102
  4. Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
    Python 274 21
  5. Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
    C++ 180 13