Exporting LLMs with HuggingFace’s Optimum ExecuTorch (original) (raw)

Optimum ExecuTorch provides a streamlined way to export Hugging Face transformer models to ExecuTorch format. It offers seamless integration with the Hugging Face ecosystem, making it easy to export models directly from the Hugging Face Hub.

Overview#

Optimum ExecuTorch supports a much wider variety of model architectures compared to ExecuTorch’s native export_llm API. While export_llm focuses on a limited set of highly optimized models (Llama, Qwen, Phi, and SmolLM) with advanced features like SpinQuant and attention sink, Optimum ExecuTorch can export diverse architectures including Gemma, Mistral, GPT-2, BERT, T5, Whisper, Voxtral, and many others.

Use Optimum ExecuTorch when:#

Use export_llm when:#

See Exporting LLMs for details on using the native export_llm API.

Prerequisites#

Installation#

First, clone and install Optimum ExecuTorch from source:

git clone https://github.com/huggingface/optimum-executorch.git cd optimum-executorch pip install '.[dev]'

For access to the latest features and optimizations, install dependencies in dev mode:

This installs executorch, torch, torchao, transformers, and other dependencies from nightly builds or source.

Supported Models#

Optimum ExecuTorch supports a wide range of model architectures including decoder-only LLMs (Llama, Qwen, Gemma, Mistral, etc.), multimodal models, vision models, audio models (Whisper), encoder models (BERT, RoBERTa), and seq2seq models (T5).

For the complete list of supported models, see the Optimum ExecuTorch documentation.

Export Methods#

Optimum ExecuTorch offers two ways to export models:

Method 1: CLI Export#

The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format.

Basic Export#

optimum-cli export executorch
--model "HuggingFaceTB/SmolLM2-135M-Instruct"
--task "text-generation"
--recipe "xnnpack"
--output_dir="./smollm2_exported"

With Optimizations#

Add custom SDPA, KV cache optimization, and quantization:

optimum-cli export executorch
--model "HuggingFaceTB/SmolLM2-135M-Instruct"
--task "text-generation"
--recipe "xnnpack"
--use_custom_sdpa
--use_custom_kv_cache
--qlinear 8da4w
--qembedding 8w
--output_dir="./smollm2_exported"

Available CLI Arguments#

Key arguments for LLM export include --model, --task, --recipe (backend), --use_custom_sdpa, --use_custom_kv_cache, --qlinear (linear quantization), --qembedding (embedding quantization), and --max_seq_len.

For the complete list of arguments, run:

optimum-cli export executorch --help

Optimization Options#

Custom Operators#

Optimum ExecuTorch includes custom SDPA (~3x speedup) and custom KV cache (~2.5x speedup) operators. Enable with --use_custom_sdpa and --use_custom_kv_cache.

Quantization#

Optimum ExecuTorch uses TorchAO for quantization. Common options:

Example:

optimum-cli export executorch
--model "meta-llama/Llama-3.2-1B"
--task "text-generation"
--recipe "xnnpack"
--use_custom_sdpa
--use_custom_kv_cache
--qlinear 8da4w
--qembedding 4w
--output_dir="./llama32_1b"

Backend Support#

Supported backends: xnnpack (CPU), coreml (Apple GPU), portable (baseline), cuda (NVIDIA GPU). Specify with --recipe.

Exporting Different Model Types#

Optimum ExecuTorch supports various model architectures with different tasks:

For detailed examples of exporting each model type, see the Optimum ExecuTorch export guide.

Running Exported Models#

Verifying Output with Python#

After exporting, you can verify the model output in Python before deploying to device using classes from modeling.py, such as the ExecuTorchModelForCausalLM class for LLMs:

from optimum.executorch import ExecuTorchModelForCausalLM from transformers import AutoTokenizer

Load the exported model

model = ExecuTorchModelForCausalLM.from_pretrained("./smollm2_exported") tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

Generate text

generated_text = model.text_generation( tokenizer=tokenizer, prompt="Once upon a time", max_seq_len=128, ) print(generated_text)

Running on Device#

After verifying your model works correctly, deploy it to device:

Performance#

For performance benchmarks and on-device metrics, see the Optimum ExecuTorch benchmarks and the ExecuTorch Benchmark Dashboard.

Additional Resources#