FAQs — Model Optimizer 0.27.1 (original) (raw)

ModelOpt-Windows

ONNX PTQ

1. Why is awq-scale search taking too long or stuck at 0% during ONNX INT4 quantization?

Awq-scale search should complete in minutes with NVIDIA GPU acceleration. If stalled:

2. Why is “CUDA EP not found” error showing during ONNX quantization?

ORT used in GenAI may conflict with ModelOpt-Windows ORT:

3. Why does ORT-session creation fail for CUDA-EP despite having CUDA toolkit and cuDNN?

This usually results from mismatched CUDA and cuDNN versions or missing paths. Ensure:

4. Why quantized model’s size increases on re-runs?

Make sure that the output directory is clean before each quantization run otherwise, existing quantized model file may get appended in each run leading to increase in model’s size and possibly corrupting it.

5. Running INT4 quantized ONNX model on DirectML backend gives following error. What can be the issue?

Error Unrecognized attribute: block_size for operator DequantizeLinear

ModelOpt-Windows uses ONNX’s DequantizeLinear (DQ) nodes. The int4 data-type support in DeQuantizeLinear node came in opset-21. And, block_size attribute was added in DeQuantizeLinear node in Opset-21. Make sure that quantized model’s opset version is 21 or higher. Refer Apply Post Training Quantization (PTQ) for details.

6. Running INT4 quantized ONNX model on DirectML backend gives following kind of error. What can be the issue?

Error: Type ‘tensor(int4)’ of input parameter (onnx::MatMul_6508_i4) of operator (DequantizeLinear) in node (onnx::MatMul_6508_DequantizeLinear) is invalid.

One possible reason for above error is that INT4 quantized ONNX model’s opset version (default or onnx domain) is less than 21. Ensure the INT4 quantized model’s opset version is 21 or higher since INT4 data-type support in DeQuantizeLinear ONNX node came in opset-21.

7. Running 8-bit quantized ONNX model with ORT-DML gives onnxruntime error about using 8-bit data-type (e.g. INT8/FP8). What can be the issue?

Currently, DirectML backend (ORT-DML) doesn’t support 8-bit precision. So, it expectedly complains about 8-bit data-type. Try using ORT-CUDA or other 8-bit compatible backend.

8. How to resolve onnxruntime error about invalid use of FP8 type in QuantizeLinear / DeQuantizeLinear node?

The FP8 type support in QuantizeLinear / DeQuantizeLinear node came with Opset-19. So, ensure that opset of ONNX model is 19+.

NAS/Pruning

Parallel search

If your score_func in mtn.search()supports parallel evaluation, you can make use of it by passing in DistributedDataParallelmodule to search.

Monkey-patched functions

During the conversion process (mtn.convert()), we use a monkey patch to augment the forward(),eval(), and train() methods of nn.Module. This renders the ModelOpt conversion process incompatible with other monkey patches to those methods.

Internally in mtn.convert, we do:

model.forward = types.MethodType(nas_forward_func, model) model.train = types.MethodType(nas_train_func, model)

Known Issues

1. Potential memory leak for FSDP with use_orig_params=True

When using FSDP with use_orig_params=True, there is a potential memory leak during training when using FSDP in conjunction with modelopt-converted models. Please useuse_orig_params=False to avoid this issue.