[Guide] Quantize LLM CoreML to int8 on Mac ARM (TinyLlama, May 2025, tested workflow & script) (original) (raw)

Hi all,

Like many devs here, I spent a lot of time looking for a working, recent and clear pipeline to quantize LLM CoreML models (.mlpackage) to int8 on Mac ARM (Apple Silicon).

Most existing guides are out of date, broken, or don’t cover the new coremltools (8.x) + Python 3.11+ + Apple Silicon environment.

What works (May 2025, tested!)

Platform: Mac ARM (M1/M2/M3), Python 3.11+, coremltools 8.3.0+
Model: TinyLlama-1.1B-Chat-v0.3-CoreML (but same for many CoreML .mlpackage LLMs)
Result: Quantized int8 model for on-device iOS/macOS inference, file size drop, less RAM

Script (put this in `quantize_coreml.py`)

import coremltools as ct
from coremltools.optimize.coreml import OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights

float16_model_path = "float16_model.mlpackage"
quant8_model_path  = "quant8_model.mlpackage"

print(f"🔄 Loading float16 model: {float16_model_path}")
model = ct.models.MLModel(float16_model_path)

op_config = OpLinearQuantizerConfig(mode="linear_symmetric")
config = OptimizationConfig(global_config=op_config)

print("⚡ Quantizing to 8 bits (int8)...")
try:
    quant8_model = linear_quantize_weights(model, config=config)
    quant8_model.save(quant8_model_path)
    print(f"✅ Quantized int8 model saved: {quant8_model_path}")
except Exception as e:
    print(f"❌ Quantization failed: {e}")

Full details & README

Check the full workflow, troubleshooting, and instructions here (with logs):
GitHub - GreenBull31/quantize-coreml-tinyllama: Quantize TinyLlama CoreML on Mac ARM

⸻

Notes
• 4-bit quantization is only available with iOS 18+ target (will fail otherwise).
• Ignore warnings like inf/-inf not supported by quantization. Skipped. – normal for some LLM weights.

⸻

Feel free to comment, ask questions, or share your results. Happy quantizing!

Morgan (GreenBull31)

[Guide] Quantize LLM CoreML to int8 on Mac ARM (TinyLlama, May 2025, tested workflow & script) (original) (raw)

What works (May 2025, tested!)

Script (put this in quantize_coreml.py)

Script (put this in `quantize_coreml.py`)