[Guide] Quantize LLM CoreML to int8 on Mac ARM (TinyLlama, May 2025, tested workflow & script) (original) (raw)
Hi all,
Like many devs here, I spent a lot of time looking for a working, recent and clear pipeline to quantize LLM CoreML models (.mlpackage) to int8 on Mac ARM (Apple Silicon).
Most existing guides are out of date, broken, or donβt cover the new coremltools (8.x) + Python 3.11+ + Apple Silicon environment.
What works (May 2025, tested!)
- Platform: Mac ARM (M1/M2/M3), Python 3.11+, coremltools 8.3.0+
- Model: TinyLlama-1.1B-Chat-v0.3-CoreML (but same for many CoreML .mlpackage LLMs)
- Result: Quantized int8 model for on-device iOS/macOS inference, file size drop, less RAM
Script (put this in quantize_coreml.py
)
import coremltools as ct
from coremltools.optimize.coreml import OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights
float16_model_path = "float16_model.mlpackage"
quant8_model_path = "quant8_model.mlpackage"
print(f"π Loading float16 model: {float16_model_path}")
model = ct.models.MLModel(float16_model_path)
op_config = OpLinearQuantizerConfig(mode="linear_symmetric")
config = OptimizationConfig(global_config=op_config)
print("β‘ Quantizing to 8 bits (int8)...")
try:
quant8_model = linear_quantize_weights(model, config=config)
quant8_model.save(quant8_model_path)
print(f"β
Quantized int8 model saved: {quant8_model_path}")
except Exception as e:
print(f"β Quantization failed: {e}")
Full details & README
Check the full workflow, troubleshooting, and instructions here (with logs): GitHub - GreenBull31/quantize-coreml-tinyllama: Quantize TinyLlama CoreML on Mac ARM
βΈ»
Notes
β’ 4-bit quantization is only available with iOS 18+ target (will fail otherwise).
β’ Ignore warnings like inf/-inf not supported by quantization. Skipped. β normal for some LLM weights.
βΈ»
Feel free to comment, ask questions, or share your results. Happy quantizing!
Morgan (GreenBull31)