Inference Protocols and APIs — NVIDIA Triton Inference Server (original) (raw)

Clients can communicate with Triton using either an HTTP/REST protocol, a GRPC protocol, or by an in-process C API or itsC++ wrapper.

HTTP/REST and GRPC Protocols#

Triton exposes both HTTP/REST and GRPC endpoints based on standard inference protocolsthat have been proposed by the KServe project. To fully enable all capabilities Triton also implements HTTP/REST and GRPC extensionsto the KServe inference protocol. GRPC protocol also provides a bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream. We typically recommend using the unary version for inference requests. The streaming version should be used only if the situation demands it. Some of such use cases can be:

The HTTP/REST and GRPC protocols also provide endpoints to check server and model health, metadata and statistics. Additional endpoints allow model loading and unloading, and inferencing. See the KServe and extension documentation for details.

HTTP Options#

Triton provides the following configuration options for server-client network transactions over HTTP protocol.

Compression#

Triton allows the on-wire compression of request/response on HTTP through its clients. See HTTP Compression for more details.

Mapping Triton Server Error Codes to HTTP Status Codes#

This table maps various Triton Server error codes to their corresponding HTTP status codes. It can be used as a reference guide for understanding how Triton Server errors are handled in HTTP responses.

GRPC Options#

Triton exposes various GRPC parameters for configuring the server-client network transactions. For usage of these options, refer to the output from tritonserver --help.

SSL/TLS#

These options can be used to configure a secured channel for communication. The server-side options include:

For client-side documentation, see Client-Side GRPC SSL/TLS

For more details on overview of authentication in gRPC, refer here.

Compression#

Triton allows the on-wire compression of request/response messages by exposing following option on server-side:

For client-side documentation, see Client-Side GRPC Compression

Compression can be used to reduce the amount of bandwidth used in server-client communication. For more details, see gRPC Compression.

GRPC KeepAlive#

Triton exposes GRPC KeepAlive parameters with the default values for both client and server described here.

These options can be used to configure the KeepAlive settings:

For client-side documentation, see Client-Side GRPC KeepAlive.

GRPC Status Codes#

Triton implements GRPC error handling for streaming requests when a specific flag is enabled through headers. Upon encountering an error, Triton returns the appropriate GRPC error code and subsequently closes the stream.

GRPC status codes can be used for better visibility and monitoring. For more details, see gRPC Status Codes

For client-side documentation, see Client-Side GRPC Status Codes

Limit Endpoint Access (BETA)#

Triton users may want to restrict access to protocols or APIs that are provided by the GRPC or HTTP endpoints of a server. For example, users can provide one set of access credentials for inference APIs and another for model control APIs such as model loading and unloading.

The following options can be specified to declare a restricted protocol group (GRPC) or restricted API group (HTTP):

--grpc-restricted-protocol=,,...:= --http-restricted-api=,API_2>,...:=

The option can be specified multiple times to specifies multiple groups of protocols or APIs with different restriction settings.

Example#

To start the server with a set of protocols and APIs restricted foradmin usage and the rest of the protocols and APIs left unrestricted use the following command line arguments:

tritonserver --grpc-restricted-protocol=shared-memory,model-config,model-repository,statistics,trace:=
--http-restricted-api=shared-memory,model-config,model-repository,statistics,trace:= ...

GRPC requests to admin protocols require that an additional headertriton-grpc-protocol-<admin-key> is provided with value<admin-value>. HTTP requests to admin APIs required that an additional header <admin-key> is provided with value <admin-value>.