cuTENSOR Functions — cuTENSOR (original) (raw)

Helper Functions#

The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.


cutensorCreate()#

cutensorStatus_t cutensorCreate(cutensorHandle_t *handle)#

Initializes the cuTENSOR library and allocates the memory for the library context.

The device associated with a particular cuTENSOR handle is assumed to remain unchanged after the cutensorCreate call. In order for the cuTENSOR library to use a different device, the application must set the new device to be used by calling cudaSetDevice and then create another cuTENSOR handle, which will be associated with the new device, by calling cutensorCreate.

Moreover, each handle by default has a plan cache that can store the least recently used cutensorPlan_t; its default capacity is 64, but it can be changed via cutensorHandleResizePlanCache if this is too little storage space. See the Plan Cache Guide for more information.

The user is responsible for calling cutensorDestroy to free the resources associated with the handle.

Parameters:

handle[out] Pointer to cutensorHandle_t

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorDestroy()#

cutensorStatus_t cutensorDestroy(cutensorHandle_t handle)#

Frees all resources related to the provided library handle.

Parameters:

handle[inout] Pointer to cutensorHandle_t

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorCreateTensorDescriptor()#

cutensorStatus_t cutensorCreateTensorDescriptor(

const cutensorHandle_t handle,

cutensorTensorDescriptor_t *desc,

const uint32_t numModes,

const int64_t extent[],

const int64_t stride[],

cutensorDataType_t dataType,

uint32_t alignmentRequirement,

)#

Creates a tensor descriptor.

This allocates a small amount of host-memory.

The user is responsible for calling cutensorDestroyTensorDescriptor() to free the associated resources once the tensor descriptor is no longer used.

Parameters:

Return values:

Pre:

extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes


cutensorDestroyTensorDescriptor()#

cutensorStatus_t cutensorDestroyTensorDescriptor(

cutensorTensorDescriptor_t desc,

)#

Frees all resources related to the provided tensor descriptor.

Parameters:

desc[inout] The cutensorTensorDescriptor_t object that will be deallocated.

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorGetErrorString()#

const char *cutensorGetErrorString(const cutensorStatus_t error)#

Returns the description string for an error code.

Parameters:

error[in] Error code to convert to string.

Return values:

The – null-terminated error string.


cutensorGetVersion()#

size_t cutensorGetVersion()#

Returns Version number of the CUTENSOR library.


cutensorGetCudartVersion()#

size_t cutensorGetCudartVersion()#

Returns version number of the CUDA runtime that cuTENSOR was compiled against.

Can be compared against the CUDA runtime version from cudaRuntimeGetVersion().


Element-wise Operations#

The following functions perform element-wise operations between tensors.


cutensorCreateElementwiseTrinary()#

cutensorStatus_t cutensorCreateElementwiseTrinary(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t *desc,

const cutensorTensorDescriptor_t descA,

const int32_t modeA[],

cutensorOperator_t opA,

const cutensorTensorDescriptor_t descB,

const int32_t modeB[],

cutensorOperator_t opB,

const cutensorTensorDescriptor_t descC,

const int32_t modeC[],

cutensorOperator_t opC,

const cutensorTensorDescriptor_t descD,

const int32_t modeD[],

cutensorOperator_t opAB,

cutensorOperator_t opABC,

const cutensorComputeDescriptor_t descCompute,

)#

This function creates an operation descriptor that encodes an elementwise trinary operation.

Said trinary operation has the following general form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Where

Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:

Input tensors may be read even if the value of the corresponding scalar is zero.

Examples:

Call cutensorElementwiseTrinaryExecute to perform the actual operation.

Please use cutensorDestroyOperationDescriptor to deallocated the descriptor once it is no longer used.

Supported data-type combinations are:

Parameters:

Return values:


cutensorElementwiseTrinaryExecute()#

cutensorStatus_t cutensorElementwiseTrinaryExecute(

const cutensorHandle_t handle,

const cutensorPlan_t plan,

const void *alpha,

const void *A,

const void *beta,

const void *B,

const void *gamma,

const void *C,

void *D,

cudaStream_t stream,

)#

Performs an element-wise tensor operation for three input tensors (see cutensorCreateElementwiseTrinary)

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

See cutensorCreateElementwiseTrinary() for details.

Parameters:

Return values:


cutensorCreateElementwiseBinary()#

cutensorStatus_t cutensorCreateElementwiseBinary(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t *desc,

const cutensorTensorDescriptor_t descA,

const int32_t modeA[],

cutensorOperator_t opA,

const cutensorTensorDescriptor_t descC,

const int32_t modeC[],

cutensorOperator_t opC,

const cutensorTensorDescriptor_t descD,

const int32_t modeD[],

cutensorOperator_t opAC,

const cutensorComputeDescriptor_t descCompute,

)#

This function creates an operation descriptor for an elementwise binary operation.

The binary operation has the following general form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Call cutensorElementwiseBinaryExecute to perform the actual operation.

Supported data-type combinations are:

Parameters:

Return values:


cutensorElementwiseBinaryExecute()#

cutensorStatus_t cutensorElementwiseBinaryExecute(

const cutensorHandle_t handle,

const cutensorPlan_t plan,

const void *alpha,

const void *A,

const void *gamma,

const void *C,

void *D,

cudaStream_t stream,

)#

Performs an element-wise tensor operation for two input tensors (see cutensorCreateElementwiseBinary)

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

See cutensorCreateElementwiseBinary() for details.

Parameters:

Return values:


cutensorCreatePermutation()#

cutensorStatus_t cutensorCreatePermutation(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t *desc,

const cutensorTensorDescriptor_t descA,

const int32_t modeA[],

cutensorOperator_t opA,

const cutensorTensorDescriptor_t descB,

const int32_t modeB[],

const cutensorComputeDescriptor_t descCompute,

)#

This function creates an operation descriptor for a tensor permutation.

The tensor permutation has the following general form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Consequently, this function performs an out-of-place tensor permutation and is a specialization of cutensorCreateElementwiseBinary.

Where

Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Modes may appear in any order. The only restrictions are:

Supported data-type combinations are:

Parameters:

Return values:


cutensorPermute()#

cutensorStatus_t cutensorPermute(

const cutensorHandle_t handle,

const cutensorPlan_t plan,

const void *alpha,

const void *A,

void *B,

const cudaStream_t stream,

)#

Performs the tensor permutation that is encoded by plan (see cutensorCreatePermutation).

This function performs an elementwise tensor operation of the form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Consequently, this function performs an out-of-place tensor permutation.

Where

Parameters:

Return values:


Contraction Operations#

The following functions perform contractions between tensors.


cutensorCreateContraction()#

cutensorStatus_t cutensorCreateContraction(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t *desc,

const cutensorTensorDescriptor_t descA,

const int32_t modeA[],

cutensorOperator_t opA,

const cutensorTensorDescriptor_t descB,

const int32_t modeB[],

cutensorOperator_t opB,

const cutensorTensorDescriptor_t descC,

const int32_t modeC[],

cutensorOperator_t opC,

const cutensorTensorDescriptor_t descD,

const int32_t modeD[],

const cutensorComputeDescriptor_t descCompute,

)#

This function allocates a cutensorOperationDescriptor_t object that encodes a tensor contraction of the form \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \).

Allocates data for desc to be used to perform a tensor contraction of the form

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha op_\mathcal{A}(\mathcal{A}_{{modes}_\mathcal{A}}) op_\mathcal{B}(B_{{modes}_\mathcal{B}}) + \beta op_\mathcal{C}(\mathcal{C}_{{modes}_\mathcal{C}}). \]

See cutensorCreatePlan (or cutensorCreatePlanAutotuned) to create the plan (i.e., to select the kernel) followed by a call to cutensorContract to perform the actual contraction.

The user is responsible for calling cutensorDestroyOperationDescriptor to free the resources associated with the descriptor.

Supported data-type combinations are:

Parameters:

Return values:


cutensorContract()#

cutensorStatus_t cutensorContract(

const cutensorHandle_t handle,

const cutensorPlan_t plan,

const void *alpha,

const void *A,

const void *B,

const void *beta,

const void *C,

void *D,

void *workspace,

uint64_t workspaceSize,

cudaStream_t stream,

)#

This routine computes the tensor contraction \( D = alpha * A * B + beta * C \).

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]

The active CUDA device must match the CUDA device that was active at the time at which the plan was created.

[Example]

See NVIDIA/CUDALibrarySamples for a concrete example.

Parameters:

Return values:


cutensorCreateContractionTrinary()#

cutensorStatus_t cutensorCreateContractionTrinary(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t *desc,

const cutensorTensorDescriptor_t descA,

const int32_t modeA[],

cutensorOperator_t opA,

const cutensorTensorDescriptor_t descB,

const int32_t modeB[],

cutensorOperator_t opB,

const cutensorTensorDescriptor_t descC,

const int32_t modeC[],

cutensorOperator_t opC,

const cutensorTensorDescriptor_t descD,

const int32_t modeD[],

cutensorOperator_t opD,

const cutensorTensorDescriptor_t descE,

const int32_t modeE[],

const cutensorComputeDescriptor_t descCompute,

)#

This function allocates a cutensorOperationDescriptor_t object that encodes a tensor contraction of the form \( \mathcal{E} = \alpha \mathcal{A} \mathcal{B} \mathcal{C} + \beta \mathcal{D} \).

Allocates data for desc to be used to perform a tensor contraction of the form

\[ \mathcal{E}_{{modes}_\mathcal{E}} \gets \alpha op_\mathcal{A}(\mathcal{A}_{{modes}_\mathcal{A}}) op_\mathcal{B}(\mathcal{B}_{{modes}_\mathcal{B}}) op_\mathcal{C}(\mathcal{C}_{{modes}_\mathcal{C}}) + \beta op_\mathcal{D}(\mathcal{D}_{{modes}_\mathcal{D}}). \]

See cutensorCreatePlan (or cutensorCreatePlanAutotuned) to create the plan (i.e., to select the kernel) followed by a call to cutensorContractTrinary to perform the actual contraction.

The user is responsible for calling cutensorDestroyOperationDescriptor to free the resources associated with the descriptor.

The performance improvements due to this API are currently especially high if your data resides on the host (i.e. out-of-core), targeting Grace-based systems.

Supported data-type combinations are:

Parameters:

Return values:


cutensorContractTrinary()#

cutensorStatus_t cutensorContractTrinary(

const cutensorHandle_t handle,

const cutensorPlan_t plan,

const void *alpha,

const void *A,

const void *B,

const void *C,

const void *beta,

const void *D,

void *E,

void *workspace,

uint64_t workspaceSize,

cudaStream_t stream,

)#

This routine computes the tensor contraction \( E = alpha * A * B * C + beta * D \).

\[ \mathcal{E}_{{modes}_\mathcal{E}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} \mathcal{B}_{{modes}_\mathcal{B}} \mathcal{C}_{{modes}_\mathcal{C}} + \beta \mathcal{D}_{{modes}_\mathcal{D}} \]

The active CUDA device must match the CUDA device that was active at the time at which the plan was created.

[Example]

See NVIDIA/CUDALibrarySamples for a concrete example.

Parameters:

Return values:


Reduction Operations#

The following functions perform tensor reductions.


cutensorCreateReduction()#

cutensorStatus_t cutensorCreateReduction(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t *desc,

const cutensorTensorDescriptor_t descA,

const int32_t modeA[],

cutensorOperator_t opA,

const cutensorTensorDescriptor_t descC,

const int32_t modeC[],

cutensorOperator_t opC,

const cutensorTensorDescriptor_t descD,

const int32_t modeD[],

cutensorOperator_t opReduce,

const cutensorComputeDescriptor_t descCompute,

)#

Creates a cutensorOperatorDescriptor_t object that encodes a tensor reduction of the form \( D = alpha * opReduce(opA(A)) + beta * opC(C) \).

For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];

This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the k-mode are contracted.

The binary opReduce operator provides extra control over what kind of a reduction ought to be performed. For instance, setting opReduce to CUTENSOR_OP_ADD reduces element of A via a summation while CUTENSOR_OP_MAX would find the largest element in A.

Supported data-type combinations are:

Parameters:

Return values:


cutensorReduce()#

cutensorStatus_t cutensorReduce(

const cutensorHandle_t handle,

const cutensorPlan_t plan,

const void *alpha,

const void *A,

const void *beta,

const void *C,

void *D,

void *workspace,

uint64_t workspaceSize,

cudaStream_t stream,

)#

Performs the tensor reduction that is encoded by plan (see cutensorCreateReduction).

Parameters:

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.


Generic Operation Functions#

The following functions are generic and work with all the different operations.


cutensorDestroyOperationDescriptor()#

cutensorStatus_t cutensorDestroyOperationDescriptor(

cutensorOperationDescriptor_t desc,

)#

Frees all resources related to the provided descriptor.

Parameters:

desc[inout] The cutensorOperationDescriptor_t object that will be deallocated.

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorOperationDescriptorGetAttribute()#

cutensorStatus_t cutensorOperationDescriptorGetAttribute(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t desc,

cutensorOperationDescriptorAttribute_t attr,

void *buf,

size_t sizeInBytes,

)#

This function retrieves an attribute of the provided cutensorOperationDescriptor_t object (see cutensorOperationDescriptorAttribute_t).

Parameters:

Return values:


cutensorOperationDescriptorSetAttribute()#

cutensorStatus_t cutensorOperationDescriptorSetAttribute(

const cutensorHandle_t handle,

cutensorOperationDescriptor_t desc,

cutensorOperationDescriptorAttribute_t attr,

const void *buf,

size_t sizeInBytes,

)#

Set attribute of a cutensorOperationDescriptor_t object.

Parameters:

Return values:


cutensorCreatePlanPreference()#

cutensorStatus_t cutensorCreatePlanPreference(

const cutensorHandle_t handle,

cutensorPlanPreference_t *pref,

cutensorAlgo_t algo,

cutensorJitMode_t jitMode,

)#

Allocates the cutensorPlanPreference_t, enabling users to limit the applicable kernels for a given plan/operation.

Parameters:


cutensorDestroyPlanPreference()#

cutensorStatus_t cutensorDestroyPlanPreference(

cutensorPlanPreference_t pref,

)#

Frees all resources related to the provided preference.

Parameters:

pref[inout] The cutensorPlanPreference_t object that will be deallocated.

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorPlanPreferenceSetAttribute()#

cutensorStatus_t cutensorPlanPreferenceSetAttribute(

const cutensorHandle_t handle,

cutensorPlanPreference_t pref,

cutensorPlanPreferenceAttribute_t attr,

const void *buf,

size_t sizeInBytes,

)#

Set attribute of a cutensorPlanPreference_t object.

Parameters:

Return values:


cutensorEstimateWorkspaceSize()#

cutensorStatus_t cutensorEstimateWorkspaceSize(

const cutensorHandle_t handle,

const cutensorOperationDescriptor_t desc,

const cutensorPlanPreference_t planPref,

const cutensorWorksizePreference_t workspacePref,

uint64_t *workspaceSizeEstimate,

)#

Determines the required workspaceSize for the given operation encoded by desc.

Parameters:

Return values:


cutensorCreatePlan()#

cutensorStatus_t cutensorCreatePlan(

const cutensorHandle_t handle,

cutensorPlan_t *plan,

const cutensorOperationDescriptor_t desc,

const cutensorPlanPreference_t pref,

uint64_t workspaceSizeLimit,

)#

This function allocates a cutensorPlan_t object, selects an appropriate kernel for a given operation (encoded by desc) and prepares a plan that encodes the execution.

This function applies cuTENSOR’s heuristic to select a candidate/kernel for a given operation (created by either cutensorCreateContraction, cutensorCreateReduction, cutensorCreatePermutation, cutensorCreateElementwiseBinary, cutensorCreateElementwiseTrinary, or cutensorCreateContractionTrinary). The created plan can then be be passed to either cutensorContract, cutensorReduce, cutensorPermute, cutensorElementwiseBinaryExecute, cutensorElementwiseTrinaryExecute ,or cutensorContractTrinary to perform the actual operation.

The plan is created for the active CUDA device.

Note: cutensorCreatePlan must not be captured via CUDA graphs if Just-In-Time compilation is enabled (i.e., cutensorJitMode_t is not CUTENSOR_JIT_MODE_NONE).

Parameters:

Return values:


cutensorDestroyPlan()#

cutensorStatus_t cutensorDestroyPlan(cutensorPlan_t plan)#

Frees all resources related to the provided plan.

Parameters:

plan[inout] The cutensorPlan_t object that will be deallocated.

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorPlanGetAttribute()#

cutensorStatus_t cutensorPlanGetAttribute(

const cutensorHandle_t handle,

const cutensorPlan_t plan,

cutensorPlanAttribute_t attr,

void *buf,

size_t sizeInBytes,

)#

Retrieves information about an already-created plan (see cutensorPlanAttribute_t)

Parameters:

Return values:


cutensorPlanPreferenceSetAttribute()#

cutensorStatus_t cutensorPlanPreferenceSetAttribute(

const cutensorHandle_t handle,

cutensorPlanPreference_t pref,

cutensorPlanPreferenceAttribute_t attr,

const void *buf,

size_t sizeInBytes,

)

Set attribute of a cutensorPlanPreference_t object.

Parameters:

Return values:


Logger Functions#

cutensorLoggerSetCallback()#

cutensorStatus_t cutensorLoggerSetCallback(

cutensorLoggerCallback_t callback,

)#

This function sets the logging callback routine.

Parameters:

callback[in] Pointer to a callback function. Check cutensorLoggerCallback_t.


cutensorLoggerSetFile()#

cutensorStatus_t cutensorLoggerSetFile(FILE *file)#

This function sets the logging output file.

Parameters:

file[in] An open file with write permission.


cutensorLoggerOpenFile()#

cutensorStatus_t cutensorLoggerOpenFile(const char *logFile)#

This function opens a logging output file in the given path.

Parameters:

logFile[in] Path to the logging output file.


cutensorLoggerSetLevel()#

cutensorStatus_t cutensorLoggerSetLevel(int32_t level)#

This function sets the value of the logging level.

Parameters:

level[in] Log level, should be one of the following:


cutensorLoggerSetMask()#

cutensorStatus_t cutensorLoggerSetMask(int32_t mask)#

This function sets the value of the log mask.

Parameters:

mask[in] Log mask, the bitwise OR of the following:


cutensorLoggerForceDisable()#

cutensorStatus_t cutensorLoggerForceDisable()#

This function disables logging for the entire run.