Troubleshooting — NVIDIA TensorRT Documentation (original) (raw)

The following sections help answer the most commonly asked questions regarding typical use cases.

For more assitance, refer to your support engineer or post your questions on the NVIDIA Developer Forum for troubleshooting support.

FAQs#

This section is to help troubleshoot the problem and answer our most asked questions.

Q: How do I create an optimized engine for several batch sizes?

A: While TensorRT allows an engine optimized for a given batch size to run at any smaller size, the performance for those smaller sizes cannot be as well optimized. To optimize for multiple batch sizes, create optimization profiles at the dimensions assigned to OptProfilerSelector::kOPT.

Q: Are calibration tables portable across TensorRT versions?

A: No. Internal implementations are continually optimized and can change between versions. For this reason, calibration tables are not guaranteed to be binary compatible with different versions of TensorRT. Applications must build new INT8 calibration tables when using a new version of TensorRT.

Q: Are engines portable across TensorRT versions?

A: By default, no. Refer to the Version Compatibility section for instructions on configuring engines for forward compatibility.

Q: How do I choose the optimal workspace size?

A: Some TensorRT algorithms require additional workspace on the GPU. The method IBuilderConfig::setMemoryPoolLimit() controls the maximum amount of workspace that can be allocated and prevents algorithms that require more workspace from being considered by the builder. At runtime, the space is allocated automatically when creating an IExecutionContext. The amount allocated is no more than is required, even if the amount set in IBuilderConfig::setMemoryPoolLimit() is much higher. Applications should, therefore, allow the TensorRT builder as much workspace as they can afford; at runtime, TensorRT allocates no more than this and typically less. The workspace size may need to be limited to less than the full device memory size if device memory is needed for other purposes during the engine build.

Q: How do I use TensorRT on multiple GPUs?

A: Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

Q: How do I get the version of TensorRT from the library file?

A: There is a symbol in the symbol table named tensorrt_version_#_#_#_# which contains the TensorRT version number. One possible way to read this symbol on Linux is to use the nm command like in the following example:

$ nm -D libnvinfer.so.* | grep tensorrt_version 00000000abcd1234 B tensorrt_version_###_#

Q: What can I do if my network produces the wrong answer?

A: There are several reasons why your network can be generating incorrect answers. Here are some troubleshooting approaches that can help diagnose the problem:

Note

Marking tensors as outputs can inhibit optimizations and, therefore, can change the results.

You can use NVIDIA Polygraphy to assist you with debugging and diagnosis.

Q: How do I implement batch normalization in TensorRT?

A: Batch normalization can be implemented using a sequence of IElementWiseLayer in TensorRT. More specifically:

adjustedScale = scale / sqrt(variance + epsilon) batchNorm = (input + bias - (adjustedScale * mean)) * adjustedScale

Q: Why does my network run slower when using DLA than without DLA?

A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Your chosen implementation depends on your latency or throughput requirements and power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations to increase the throughput of your network further.

Q: Does TensorRT support INT4 quantization or INT16 quantization?

A: TensorRT supports INT4 quantization for GEMM weight-only quantization. TensorRT does not support INT16 quantization.

Q: When will TensorRT support my network in the UFF parser require layer XYZ?

A: UFF is deprecated. We recommend users switch their workflows to ONNX. The TensorRT ONNX parser is an open-source project.

Q: Can I use multiple TensorRT builders to compile on different targets?

A: TensorRT assumes that all resources for the device it is building on are available for optimization purposes. Concurrent use of multiple TensorRT builders (for example, multiple trtexec instances) to compile on different targets (DLA0, DLA1, and GPU) can oversubscribe system resources causing undefined behavior (meaning, inefficient plans, builder failure, or system instability).

Using trtexec with the -saveEngine argument, it is recommended to compile for different targets (DLA and GPU) separately and save their plan files. Such plan files can then be reused for loading (using trtexec with the --loadEngine argument) and submitting multiple inference jobs on the respective targets (DLA0, DLA1, and GPU). This two-step process alleviates over-subscription of system resources during the build phase while also allowing execution of the plan file to proceed without interference by the builder.

Q: Which layers are accelerated by Tensor Cores?

A: Most math-bound operations will be accelerated with tensor cores - convolution, deconvolution, fully connected, and matrix multiply. In some cases, particularly for small channel counts or small group sizes, another implementation may be faster and be selected instead of a tensor core implementation.

Q: Why are reformatting layers observed, although there is “”no warning message that no implementation obeys reformatting-free rules”?

A: Reformat-free network I/O does not mean reformatting layers are not inserted into the entire network. Only the input and output network tensors can be configured not to require reformatting layers; in other words, TensorRT can insert reformatting layers for internal tensors to improve performance.

Understanding Error Messages#

If an error occurs during execution, TensorRT reports an error message intended to help debug the problem. The following sections discuss some common error messages that developers can encounter.

ONNX Parser Error Messages

The following table captures the common ONNX parser error messages. For specific ONNX node support information, refer to the Operators’ support document.

TensorRT Core Library Error Messages

The following table captures the common TensorRT core library error messages.

Code Analysis Tools#

Compiler Sanitizers#

Google sanitizers are a set of code analysis tools.

Issues With dlopen And Address Sanitizer#

There is a known issue with sanitizers, which is documented here. When using dlopen on TensorRT under a sanitizer, there will be reports of memory leaks unless one of two solutions is adopted:

  1. Do not call dlclose when running under the sanitizers.
  2. Pass the flag RTLD_NODELETE to dlopen when running under sanitizers.

Issues with dlopen and Thread Sanitizer#

The thread sanitizer can list errors when using dlopen from multiple threads. To suppress this warning, create a file called tsan.supp and add the following to the file:

When running applications under thread sanitizer, set the environment variable using:

export TSAN_OPTIONS=”suppressions=tsan.supp”

Issues with CUDA and Address Sanitizer#

The address sanitizer has a known issue with CUDA applications, which is documented here. To successfully run CUDA libraries such as TensorRT under the address sanitizer, add the option protect_shadow_gap=0 to the ASAN_OPTIONS environment variable.

A known bug in CUDA 11.4 can trigger mismatched allocation and free errors in the address sanitizer. To disable these errors, add alloc_dealloc_mismatch=0 to ASAN_OPTIONS.

Issues with Undefined Behavior Sanitizer#

UndefinedBehaviorSanitizer (UBSan) reports false positives with the -fvisibility=hidden option, as documented here. Add the -fno-sanitize=vptr option to avoid UBSan reporting such false positives.

Valgrind#

Valgrind is a framework for dynamic analysis tools that can automatically detect memory management and threading bugs in applications.

Some versions of Valgrind and glibc are affected by a bug, which causes false memory leaks to be reported when dlopen is used, which can generate spurious errors when running a TensorRT application under Valgrind’s memcheck tool. To work around this, add the following to a Valgrind suppressions file as documented here:

{ Memory leak errors with dlopen Memcheck:Leak match-leak-kinds: definite ... fun:dlopen ... }

A known bug in CUDA 11.4 can trigger mismatched allocation and free errors in Valgrind. To disable these errors, add the option --show-mismatched-frees=no to the Valgrind command line.

Compute Sanitizer#

When running a TensorRT application under compute-sanitizer, cuGetProcAddress can fail with error code 500 due to missing functions. This error can be ignored or suppressed with --report-api-errors no option. This is due to CUDA backward compatibility checking if a function is usable on the CUDA toolkit/driver combination. The functions are introduced later in CUDA but unavailable on the current platform.

Understanding Formats Printed in Logs#

In logs from TensorRT, formats are printed as a type followed by stride and vectorization information. For example:

Where:

The rest of the numbers are strides in units of vectors. For this tensor, the mapping of a coordinate (n,c,h,w) to an address is:

((half*)base_address) + (60n + 1floor(c/8) + 12h + 3w) * 8 + (c mod 8)

The 1: is common to NHWC formats. For example, here is another example of an NCHW format:

The INT8 indicates that the element type is DataType::kINT8, and the :4 indicates a vector size of 4. For this tensor, the mapping of a coordinate (n,c,h,w) to an address is:

(int8_t*)base_address + (105n + 15floor(c/4) + 3*h + w) * 4 + (c mod 4)

Scalar formats have a vector size of 1. For brevity, printing omits the :1.

In general, the coordinates to address mappings have the following form:

(type*)base_address + (vec_coordinate · strides) * vec_size + vec_mod

Where:

Reporting TensorRT Issues#

If you encounter issues when using TensorRT, check the FAQs and the Understanding Error Messages sections to look for similar failing patterns. For example, many engine building failures can be solved by sanitizing and constant-folding the ONNX model using Polygraphy with the following command:

polygraphy surgeon sanitize model.onnx --fold-constants --output model_folded.onnx

In addition, it is highly recommended that you first try our latest TensorRT release before filing an issue if you have not done so since it may have been fixed in the latest release.

Channels for TensorRT Issue Reporting#

If neither the FAQs nor the Understanding Error Messages sections help, you can report the issue through the NVIDIA Developer Forum or the TensorRT GitHub Issue page. These channels are constantly monitored to provide feedback on the issues you encounter.

Here are the steps to report an issue on the NVIDIA Developer Forum:

  1. Register for the NVIDIA Developer website.
  2. Log in to the developer site.
  3. Click on your name in the upper right corner.
  4. Click My Account > My Bugs and select Submit a New Bug.
  5. Fill out the bug reporting page. Be descriptive and provide the steps to reproduce the problem.
  6. Click Submit a bug.

When reporting an issue, provide setup details and include the following information:

Depending on the type of the issue, providing more information listed below can expedite the response and debugging process.

Reporting a Functional Issue#

When reporting functional issues, such as linker errors, segmentation faults, engine building failures, inference failures, and so on, provide the scripts and commands to reproduce the issue and a detailed description of the environment. Having more details helps us debug the functional issue faster.

Since the TensorRT engine is specific to a specific TensorRT version and a specific GPU type, do not build the engine in one environment and use it to run it in another environment with different GPUs or dependency software stack, such as TensorRT version, CUDA version, cuDNN version, and so on. Also, ensure the application is linked to the correct TensorRT and cuDNN shared object files by checking the environment variable LD_LIBRARY_PATH (or %PATH% on Windows).

Reporting an Accuracy Issue#

When reporting an accuracy issue, provide the scripts and the commands used to calculate the accuracy metrics. Describe the expected accuracy level and share the steps to get the expected results using other frameworks like ONNX-Runtime.

The Polygraphy tool can debug the accuracy issue and produce a minimal failing case. For instructions, refer to the documentation on Debugging TensorRT Accuracy Issues. Having a Polygraphy command that shows the accuracy issue or having a minimal failing case expedites the time it takes for us to debug your accuracy issue.

Note that it is not practical to expect bitwise identical results between TensorRT and other frameworks like PyTorch, TensorFlow, or ONNX-Runtime even in FP32 precision since the order of the computations on the floating-point numbers can result in slight differences in output values. In practice, small numeric differences should not significantly affect the accuracy metric of the application, such as the mAP score for object-detection networks or the BLEU score for translation networks. If you see a significant drop in the accuracy metric between TensorRT and other frameworks such as PyTorch, TensorFlow, or ONNX-Runtime, it may be a genuine TensorRT bug.

If you are seeing NaNs or infinite values in TensorRT engine output when FP16/BF16 precision is enabled, it is possible that intermediate layer outputs in the network overflow in FP16/BF16. Some approaches to help mitigate this include:

Polygraphy can help you diagnose common problems by using reduced precision. Refer to Polygraphy’s Working with Reduced Precision how-to guide for more information.

For possible solutions to accuracy issues, refer to the Improving Model Accuracy section and the Working with Quantized Types section for instructions about using INT8/FP8 precision.

Reporting a Performance Issue#

If you are reporting a performance issue, share the full trtexec logs using this command:

trtexec --onnx= --verbose --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile --separateProfileRun --useCudaGraph --noDataTransfers --useSpinWait --duration=60

The verbose logs help us to identify the performance issue. If possible, also share the Nsight Systems profiling files using these commands:

trtexec --onnx= --verbose --profilingVerbosity=detailed --dumpLayerInfo --saveEngine= nsys profile --cuda-graph-trace=node -o trtexec --loadEngine= --useCudaGraph --noDataTransfers --useSpinWait --warmUp=0 --duration=0 --iterations=20

Refer to the trtexec section for more instructions on using the trtexec tool and the meaning of these flags.

If you do not use trtexec to measure performance, provide the scripts and commands you use to measure it. Compare the performance measurement from your script with that from the trtexec tool. If the two numbers differ, your scripts may have some issues with the performance measurement methodology.

Refer to the Hardware/Software Environment for Performance Measurements section for some environmental factors affecting performance.

Footnotes