PyTorch First-time Inference Performance: Understanding the Overhead (original) (raw)

When I execute the same neural network model multiple times in PyTorch, the first execution takes significantly longer than subsequent executions. This pattern is reproducible across different models.

Root Cause Analysis: What is the primary reason for this first-time execution overhead? Is it mainly due to:

1.JIT compilation overhead: PyTorch’s dynamic nature requiring compilation during the first execution, or

2.Preprocessing and optimization: Tasks such as data transfer, kernel selection, cuDNN algorithm search, memory allocation, etc.?
(I think maybe the main cost is in preprocessing right?)

Performance Measurement Strategy: For end-to-end inference acceleration research, should I:

Include the first-time overhead in my total inference time measurements, or

Focus only on steady-state performance (excluding the initial preprocessing/compilation time)?

Hi @lilil111 ,
1 - If you are measuring the total inference time for a real-time application, where the model is executed only once, then including the first-time overhead is necessary. This is because the first-time overhead is a significant component of the total inference time in this scenario.

2- If you are measuring the inference time for a batch of inputs, where the model is executed multiple times, then focusing on steady-state performance is more relevant. In this scenario, the first-time overhead is amortized over multiple executions, and the steady-state performance is a better representation of the model’s performance.

In general, it is recommended to report both the total inference time, including the first-time overhead, and the steady-state performance, excluding the initial preprocessing and compilation time. This provides a more comprehensive understanding of the model’s performance and allows for a more accurate comparison with other models and optimization techniques.

Please let me know if this helps.