Command-Line Programs — NVIDIA TensorRT Documentation (original) (raw)

trtexec#

Included in the samples directory is a command-line wrapper tool called trtexec. trtexec is a tool that can quickly utilize TensorRT without developing your application. The trtexec tool has three main purposes:

Benchmarking Network#

If you have a model saved as an ONNX file, you can use the trtexec tool to test the performance of running inference on your network using TensorRT. The trtexec tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.

To maximize GPU utilization, trtexec enqueues the inferences one batch ahead of time. In other words, it does the following:

enqueue batch 0 -> enqueue batch 1 -> wait until batch 0 is done -> enqueue batch 2 -> wait until batch 1 is done -> enqueue batch 3 -> wait until batch 2 is done -> enqueue batch 4 -> ...

If Cross-Inference Multi-Streaming (--infStreams=N flag) is used, trtexec follows this pattern on each stream separately.

The trtexec tool prints the following performance metrics. The following figure shows an example of an Nsight System profile of a trtexec run with markers showing each performance metric.

Note

In the latest Nsight Systems, the GPU rows appear above the CPU rows rather than beneath them.

Performance Metrics in a Normal trtexec Run under Nsight Systems

Add the --dumpProfile flag to trtexec to show per-layer performance profiles, which allows users to understand which layers in the network take the most time in GPU execution. The per-layer performance profiling also works with launching inference as a CUDA graph. In addition, build the engine with the --profilingVerbosity=detailed flag and add the --dumpLayerInfo flag to show detailed engine information, including per-layer detail and binding information. This allows you to understand which operation each layer in the engine corresponds to and their parameters.

Serialized Engine Generation#

If you generate a saved serialized engine file, you can pull it into another inference application. For example, you can use the NVIDIA Triton Inference Server to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats; for example, in INT8 mode, trtexec sets random dynamic ranges for tensors unless the calibration cache file is provided with the --calib=<file> flag, so the resulting accuracy will not be as expected.

Serialized Timing Cache Generation#

If you provide a timing cache file to the --timingCacheFile option, the builder can load existing profiling data from it and add new profiling data entries during layer profiling. The timing cache file can be reused in other builder instances to improve the execution time. This cache is suggested to be reused only in the same hardware/software configurations (for example, CUDA/cuDNN/TensorRT versions, device model, and clock frequency); otherwise, functional or performance issues may occur.

Commonly Used Command-Line Flags#

The section lists the commonly used trtexec command-line flags.

Refer to trtexec --help with all the supported flags and detailed explanations.

Refer to the GitHub: trtexec/README.md file for detailed information about building this tool and examples of its usage.