Torch-TensorRT weight dependent optimization (original) (raw)
April 16, 2025, 10:58am 1
Description
I am currently working on accelerating inference time for an image defect detection network. Torch-TensorRT gives great improvements, however I notice that when I retrain the network I get different inference times for exactly the same network. The differences are not gigantic, but I have seen differences of a millisecond or more, which are significant differences for my application. Why do different weights give different inference times, and how can I optimize my weights for performance and inference time during training?
Environment
TensorRT Version: 10.3.0 (Torch-TensorRT version: 2.5.0)
GPU Type: NVIDIA V100
Nvidia Driver Version:
CUDA Version: 12.2
CUDNN Version: 9.1.0.70
Operating System + Version: Red Hat Enterprise Linux 8.10
Python Version: 3.11
PyTorch Version: 2.5.1
Steps To Reproduce
- Build and train a small CNN for image classification
- Compile as TensorRT network using:
- Test inference time
- Re-train exactly the same network and compile as TensorRT network using the same code
- Test inference time
sophwats April 22, 2025, 11:33am 2
Hi @roy.van.doorn - can you share the differences in inference time in ms?
In general we don’t expect the inference time to vary much if there is no network topology change. The next step candidate could be to check if the Torch exported programs do have the same architecture and whether that graph is running E2E on TensorRT - TorchTensorRT doesn’t always run networks E2E on TRT.
Best,
Sophie
Hi,
To clarify the issue I have added a text document with sample code I have used to recreate the issue (I am not allowed to show the real code due to company policy). In this code I train and evaluate a very basic CNN that performs a classification task on the MNIST dataset. During the evaluation the network is compiled using TensorRT before testing the inference speed.
The results show that there is a difference in inference speed after retraining. During tests where the same network is retrained with the same amount of epochs, this difference is up to 34% between the slowest and fastest model. During tests where the network is retrained with with a different amount of epochs, the difference gets up to 7%.
My main question is, why is there such a big difference in inference time (especially in the first case)? And what can I do to reduce/eliminate this difference in inference time?
TensorRT_weight_tests_code.txt (9.3 KB)
Weight test results.docx (29.9 KB)