Command-Line Programs — NVIDIA TensorRT Documentation (original) (raw)

`trtexec`#

Included in the samples directory is a command-line wrapper tool called trtexec. trtexec is a tool that can quickly utilize TensorRT without developing your application. The trtexec tool has three main purposes:

It’s useful for benchmarking networks on random or user-provided input data.
It’s useful for generating serialized engines from models.
It’s useful for generating a serialized timing cache from the builder.

Benchmarking Network#

If you have a model saved as an ONNX file, you can use the trtexec tool to test the performance of running inference on your network using TensorRT. The trtexec tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.

To maximize GPU utilization, trtexec enqueues the inferences one batch ahead of time. In other words, it does the following:

enqueue batch 0 -> enqueue batch 1 -> wait until batch 0 is done -> enqueue batch 2 -> wait until batch 1 is done -> enqueue batch 3 -> wait until batch 2 is done -> enqueue batch 4 -> ...

If Cross-Inference Multi-Streaming (--infStreams=N flag) is used, trtexec follows this pattern on each stream separately.

The trtexec tool prints the following performance metrics. The following figure shows an example of an Nsight System profile of a trtexec run with markers showing each performance metric.

Throughput: The observed throughput is computed by dividing the number of inferences by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be underutilized because of host-side overheads or data transfers. CUDA graphs (with --useCudaGraph) or disabling H2D/D2H transfers (with --noDataTransfer) may improve GPU utilization. The output log guides which flag to use when trtexec detects that the GPU is underutilized.
Host Latency: The summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single inference.
Enqueue Time: The host latency to enqueue an inference, including calling H2D/D2H CUDA APIs, running host-side heuristics, and launching CUDA kernels. If this is longer than the GPU Compute Time, the GPU may be underutilized, and the throughput may be dominated by host-side overhead. Using CUDA graphs (with --useCudaGraph) may reduce Enqueue Time.
H2D Latency: The latency for host-to-device data transfers for input tensors of a single inference. Add --noDataTransfer to disable H2D/D2H data transfers.
D2H Latency: The latency for device-to-host data transfers for output tensors of a single inference. Add --noDataTransfer to disable H2D/D2H data transfers.
GPU Compute Time: The GPU latency to execute the CUDA kernels for an inference.
Total Host Walltime: The Host Walltime from when the first inference (after warm-ups) is enqueued to when the last inference was completed.
Total GPU Compute Time: The summation of the GPU Compute Time of all the inferences. If this is significantly shorter than the Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.

Note

In the latest Nsight Systems, the GPU rows appear above the CPU rows rather than beneath them.

Add the --dumpProfile flag to trtexec to show per-layer performance profiles, which allows users to understand which layers in the network take the most time in GPU execution. The per-layer performance profiling also works with launching inference as a CUDA graph. In addition, build the engine with the --profilingVerbosity=detailed flag and add the --dumpLayerInfo flag to show detailed engine information, including per-layer detail and binding information. This allows you to understand which operation each layer in the engine corresponds to and their parameters.

Serialized Engine Generation#

If you generate a saved serialized engine file, you can pull it into another inference application. For example, you can use the NVIDIA Triton Inference Server to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats; for example, in INT8 mode, trtexec sets random dynamic ranges for tensors unless the calibration cache file is provided with the --calib=<file> flag, so the resulting accuracy will not be as expected.

Serialized Timing Cache Generation#

If you provide a timing cache file to the --timingCacheFile option, the builder can load existing profiling data from it and add new profiling data entries during layer profiling. The timing cache file can be reused in other builder instances to improve the execution time. This cache is suggested to be reused only in the same hardware/software configurations (for example, CUDA/cuDNN/TensorRT versions, device model, and clock frequency); otherwise, functional or performance issues may occur.

Commonly Used Command-Line Flags#

The section lists the commonly used trtexec command-line flags.

Refer to trtexec --help with all the supported flags and detailed explanations.