API Migration Guide — NVIDIA TensorRT Documentation (original) (raw)

This section highlights the TensorRT API modifications. If you are unfamiliar with these changes, refer to our sample code for clarification.

Python#

Python API Changes#

Allocating Buffers and Using a Name-Based Engine API

TensorRT 8.x

1def allocate_buffers(self, engine): 2 ''' 3 Allocates all buffers required for an engine, i.e., host/device inputs/outputs. 4 ''' 5 inputs = [] 6 outputs = [] 7 bindings = [] 8 stream = cuda.Stream() 9 10 # binding is the name of input/output 11 for binding in the engine: 12 size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size 13 dtype = trt.nptype(engine.get_binding_dtype(binding)) 14 15 # Allocate host and device buffers 16 host_mem = cuda.pagelocked_empty(size, dtype) # page-locked memory buffer (won't swap to disk) 17 device_mem = cuda.mem_alloc(host_mem.nbytes) 18 19 # Append the device buffer address to device bindings. 20 # When cast to int, it's a linear index into the context's memory (like memory address). 21 bindings.append(int(device_mem)) 22 23 # Append to the appropriate input/output list. 24 if engine.binding_is_input(binding): 25 inputs.append(self.HostDeviceMem(host_mem, device_mem)) 26 else: 27 outputs.append(self.HostDeviceMem(host_mem, device_mem)) 28 29 return inputs, outputs, bindings, stream

TensorRT 10.0

1def allocate_buffers(self, engine): 2 ''' 3 Allocates all buffers required for an engine, i.e., host/device inputs/outputs. 4 ''' 5 inputs = [] 6 outputs = [] 7 bindings = [] 8 stream = cuda.Stream() 9 10 for i in range(engine.num_io_tensors): 11 tensor_name = engine.get_tensor_name(i) 12 size = trt.volume(engine.get_tensor_shape(tensor_name)) 13 dtype = trt.nptype(engine.get_tensor_dtype(tensor_name)) 14 15 # Allocate host and device buffers 16 host_mem = cuda.pagelocked_empty(size, dtype) # page-locked memory buffer (won't swap to disk) 17 device_mem = cuda.mem_alloc(host_mem.nbytes) 18 19 # Append the device buffer address to device bindings. 20 # When cast to int, it's a linear index into the context's memory (like memory address). 21 bindings.append(int(device_mem)) 22 23 # Append to the appropriate input/output list. 24 if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT: 25 inputs.append(self.HostDeviceMem(host_mem, device_mem)) 26 else: 27 outputs.append(self.HostDeviceMem(host_mem, device_mem)) 28 29 return inputs, outputs, bindings, stream

Transition from enqueueV2 to enqueueV3 for Python

TensorRT 8.x

1# Allocate device memory for inputs. 2d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(input_num)] 3 4# Allocate device memory for outputs. 5h_output = cuda.pagelocked_empty(output_nbytes, dtype=np.float32) 6d_output = cuda.mem_alloc(h_output.nbytes) 7 8# Transfer data from host to device. 9cuda.memcpy_htod_async(d_inputs[0], input_a, stream) 10cuda.memcpy_htod_async(d_inputs[1], input_b, stream) 11cuda.memcpy_htod_async(d_inputs[2], input_c, stream) 12 13# Run inference 14context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle) 15 16# Synchronize the stream 17stream.synchronize()

TensorRT 10.0

1# Allocate device memory for inputs. 2d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(input_num)] 3 4# Allocate device memory for outputs. 5h_output = cuda.pagelocked_empty(output_nbytes, dtype=np.float32) 6d_output = cuda.mem_alloc(h_output.nbytes) 7 8# Transfer data from host to device. 9cuda.memcpy_htod_async(d_inputs[0], input_a, stream) 10cuda.memcpy_htod_async(d_inputs[1], input_b, stream) 11cuda.memcpy_htod_async(d_inputs[2], input_c, stream) 12 13# Setup tensor address 14bindings = [int(d_inputs[i]) for i in range(3)] + [int(d_output)] 15 16for i in range(engine.num_io_tensors): 17 context.set_tensor_address(engine.get_tensor_name(i), bindings[i]) 18 19# Run inference 20context.execute_async_v3(stream_handle=stream.handle) 21 22# Synchronize the stream 23stream.synchronize()

Engine Building, use only build_serialized_network

TensorRT 8.x

1engine_bytes = None 2try: 3 engine_bytes = self.builder.build_serialized_network(self.network, self.config) 4except AttributeError: 5 engine = self.builder.build_engine(self.network, self.config) 6 engine_bytes = engine.serialize() 7 del engine 8assert engine_bytes

TensorRT 10.0

1engine_bytes = self.builder.build_serialized_network(self.network, self.config) 2if engine_bytes is None: 3 log.error("Failed to create engine") 4 sys.exit(1)

Added Python APIs#

Types

APILanguage
ExecutionContextAllocationStrategy
IGpuAsyncAllocator
InterfaceInfo
IPluginResource
IPluginV3
IStreamReader
IVersionedInterface

Methods and Properties

ICudaEngine.is_debug_tensor()
ICudaEngine.minimum_weight_streaming_budget
ICudaEngine.streamable_weights_size
ICudaEngine.weight_streaming_budget
IExecutionContext.get_debug_listener()
IExecutionContext.get_debug_state()
IExecutionContext.set_all_tensors_debug_state()
IExecutionContext.set_debug_listener()
IExecutionContext.set_tensor_debug_state()
IExecutionContext.update_device_memory_size_for_shapes()
IGpuAllocator.allocate_async()
IGpuAllocator.deallocate_async()
INetworkDefinition.add_plugin_v3()
INetworkDefinition.is_debug_tensor()
INetworkDefinition.mark_debug()
INetworkDefinition.unmark_debug()
IPluginRegistry.acquire_plugin_resource()
IPluginRegistry.all_creators
IPluginRegistry.deregister_creator()
IPluginRegistry.get_creator()
IPluginRegistry.register_creator()
IPluginRegistry.release_plugin_resource()

Removed Python APIs#

C++#

C++ API Changes#

Transition from enqueueV2 to enqueueV3 for C++

TensorRT 8.x

1// Create RAII buffer manager object. 2samplesCommon::BufferManager buffers(mEngine); 3 4auto context = SampleUniquePtrnvinfer1::IExecutionContext(mEngine->createExecutionContext()); 5if (!context) 6{ 7 return false; 8} 9 10// Pick a random digit to try to infer. 11srand(time(NULL)); 12int32_t const digit = rand() % 10; 13 14// Read the input data into the managed buffers. 15// There should be just 1 input tensor. 16ASSERT(mParams.inputTensorNames.size() == 1); 17 18if (!processInput(buffers, mParams.inputTensorNames[0], digit)) 19{ 20 return false; 21} 22// Create a CUDA stream to execute this inference. 23cudaStream_t stream; 24CHECK(cudaStreamCreate(&stream)); 25 26// Asynchronously copy data from host input buffers to device input 27buffers.copyInputToDeviceAsync(stream); 28 29// Asynchronously enqueue the inference work 30if (!context->enqueueV2(buffers.getDeviceBindings().data(), stream, nullptr)) 31{ 32 return false; 33} 34// Asynchronously copy data from device output buffers to host output buffers. 35buffers.copyOutputToHostAsync(stream); 36 37// Wait for the work in the stream to complete. 38CHECK(cudaStreamSynchronize(stream)); 39 40// Release stream. 41CHECK(cudaStreamDestroy(stream));

TensorRT 10.0

1// Create RAII buffer manager object. 2samplesCommon::BufferManager buffers(mEngine); 3 4auto context = SampleUniquePtrnvinfer1::IExecutionContext(mEngine->createExecutionContext()); 5if (!context) 6{ 7 return false; 8} 9 10for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++) 11{ 12 auto const name = mEngine->getIOTensorName(i); 13 context->setTensorAddress(name, buffers.getDeviceBuffer(name)); 14} 15 16// Pick a random digit to try to infer. 17srand(time(NULL)); 18int32_t const digit = rand() % 10; 19 20// Read the input data into the managed buffers. 21// There should be just 1 input tensor. 22ASSERT(mParams.inputTensorNames.size() == 1); 23 24if (!processInput(buffers, mParams.inputTensorNames[0], digit)) 25{ 26 return false; 27} 28// Create a CUDA stream to execute this inference. 29cudaStream_t stream; 30CHECK(cudaStreamCreate(&stream)); 31 32// Asynchronously copy data from host input buffers to device input 33buffers.copyInputToDeviceAsync(stream); 34 35// Asynchronously enqueue the inference work 36if (!context->enqueueV3(stream)) 37{ 38 return false; 39} 40 41// Asynchronously copy data from device output buffers to host output buffers. 42buffers.copyOutputToHostAsync(stream); 43 44// Wait for the work in the stream to complete. 45CHECK(cudaStreamSynchronize(stream)); 46 47// Release stream. 48CHECK(cudaStreamDestroy(stream));

64-Bit Dimension Changes#

The dimensions held by Dims changed from int32_t to int64_t. However, in TensorRT 10.0, TensorRT will generally reject networks that use dimensions exceeding the range of int32_t. The tensor type returned by IShapeLayer is now DataType::kINT64. Use ICastLayer to cast the result to the tensor of type DataType::kINT32 if 32-bit dimensions are required.

Inspect code that bitwise copies to and from Dims to ensure it is correct for int64_t dimensions.

Added C++ APIs#

Enums

ActivationType::kGELU_ERF
ActivationType::kGELU_TANH
BuilderFlag::kREFIT_IDENTICAL
BuilderFlag::kSTRIP_PLAN
BuilderFlag::kWEIGHT_STREAMING
Datatype::kINT4
LayerType::kPLUGIN_V3

Types

APILanguage
Dims64
ExecutionContextAllocationStrategy
IGpuAsyncAllocator
InterfaceInfo
IPluginResource
IPluginV3
IStreamReader
IVersionedInterface

Methods and Properties

getInferLibBuildVersion
getInferLibMajorVersion
getInferLibMinorVersion
getInferLibPatchVersion
ICudaEngine::createRefitter
IcudaEngine::getMinimumWeightStreamingBudget
IcudaEngine::getStreamableWeightsSize
ICudaEngine::getWeightStreamingBudget
IcudaEngine::isDebugTensor
ICudaEngine::setWeightStreamingBudget
IExecutionContext::getDebugListener
IExecutionContext::getTensorDebugState
IExecutionContext::setAllTensorsDebugState
IExecutionContext::setDebugListener
IExecutionContext::setOuputTensorAddress
IExecutionContext::setTensorDebugState
IExecutionContext::updateDeviceMemorySizeForShapes
IGpuAllocator::allocateAsync
IGpuAllocator::deallocateAsync
INetworkDefinition::addPluginV3
INetworkDefinition::isDebugTensor
INetworkDefinition::markDebug
INetworkDefinition::unmarkDebug
IPluginRegistry::acquirePluginResource
IPluginRegistry::deregisterCreator
IPluginRegistry::getAllCreators
IPluginRegistry::getCreator
IPluginRegistry::registerCreator
IPluginRegistry::releasePluginResource

Removed C++ APIs#

Removed C++ Plugins#

Removed Safety C++ APIs#

trtexec#

trtexec Flag Changes#

Changes to flag workspace and minTiming

TensorRT 8.x

1trtexec
2 --onnx=/path/to/model.onnx
3 --saveEngine=/path/to/engine.trt
4 --optShapes=input:$INPUT_SHAPE
5 --avgTiming=1
6 --workspace=1024
7 --minTiming=1

TensorRT 10.0

1trtexec
2 --onnx=/path/to/model.onnx
3 --saveEngine=/path/to/engine.trt
4 --optShapes=input:$INPUT_SHAPE
5 --avgTiming=1
6 --memPoolSize=workspace:1024

Removed trtexec Flags#

Deprecated trtexec Flags#

--buildOnly
--explicitPrecision
--heuristic
--nvtxMode