attention - Dot-product attention - MATLAB (original) (raw)

Dot-product attention

Since R2022b

Syntax

Description

The attention operation focuses on parts of the input using weighted multiplication operations.

[Y](#mw%5F4625f3a6-febd-4b41-b24f-ef3cd5944b5c) = attention([queries](#mw%5Fb099c356-bc3a-4150-8f29-f7a95024b8ac),[keys](#mw%5F817e982d-0f3c-4a25-a214-801270bdf4ac),[values](#mw%5F857a0616-3ce2-456c-b40f-784186271b2e),[numHeads](#mw%5F9797878c-26b2-4336-ab0f-65762eeb1f81)) applies the dot-product attention operation to the specified queries, keys, and values using the number of attention heads numHeads. The queries input argument must be a formatted dlarray object.

[[Y](#mw%5F4625f3a6-febd-4b41-b24f-ef3cd5944b5c),[weights](#mw%5F550526ba-7da3-49d6-836f-3b44240ca70c)] = attention([queries](#mw%5Fb099c356-bc3a-4150-8f29-f7a95024b8ac),[keys](#mw%5F817e982d-0f3c-4a25-a214-801270bdf4ac),[values](#mw%5F857a0616-3ce2-456c-b40f-784186271b2e),[numHeads](#mw%5F9797878c-26b2-4336-ab0f-65762eeb1f81)) applies the dot-product attention operation and also returns the attention weights..

example

[[Y](#mw%5F4625f3a6-febd-4b41-b24f-ef3cd5944b5c),[weights](#mw%5F550526ba-7da3-49d6-836f-3b44240ca70c)] = attention([queries](#mw%5Fb099c356-bc3a-4150-8f29-f7a95024b8ac),[keys](#mw%5F817e982d-0f3c-4a25-a214-801270bdf4ac),[values](#mw%5F857a0616-3ce2-456c-b40f-784186271b2e),[numHeads](#mw%5F9797878c-26b2-4336-ab0f-65762eeb1f81),DataFormat=FMT) applies the dot-product attention operation to the unformatted dlarray object queries with format specified by FMT. For example, DataFormat="CBT" specifies data in the format"CBT" (channel, batch, time).

example

[[Y](#mw%5F4625f3a6-febd-4b41-b24f-ef3cd5944b5c),[weights](#mw%5F550526ba-7da3-49d6-836f-3b44240ca70c)] = attention([queries](#mw%5Fb099c356-bc3a-4150-8f29-f7a95024b8ac),[keys](#mw%5F817e982d-0f3c-4a25-a214-801270bdf4ac),[values](#mw%5F857a0616-3ce2-456c-b40f-784186271b2e),[numHeads](#mw%5F9797878c-26b2-4336-ab0f-65762eeb1f81),[Name=Value](#namevaluepairarguments)) specifies additional options using one or more name-value arguments. For example,DropoutProbability=0.01 specifies a dropout probability of 0.01.

example

Examples

collapse all

Apply Attention Operation

Specify the sizes of the queries, keys, and values.

querySize = 100; valueSize = 120; numQueries = 64; numValues = 80; numObservations = 32;

Create random arrays containing the queries, keys, and values. For the queries, specify the dlarray format "CBT" (channel, batch, time).

queries = dlarray(rand(querySize,numObservations, numQueries),"CBT"); keys = dlarray(rand(querySize,numObservations, numValues)); values = dlarray(rand(valueSize,numObservations, numValues));

Specify the number of attention heads.

Apply the attention operation.

[Y,weights] = attention(queries,keys,values,numHeads);

View the sizes and format of the output.

View the sizes and format of the weights.

ans =

0x0 empty char array

Create Multihead Self Attention Function

You can use the attention function to implement the multihead self attention operation [1] that focuses on parts of the input.

Create the multiheadSelfAttention function, listed in the Multihead Self Attention Function section of the example. The multiheadSelfAttention function takes as input the data X, the number of heads, and the learnable weights for the queries, keys, values, and output data, and returns the multihead attention values.

The X input must be an unformatted dlarray object, where the first dimension corresponds to the input channels, the second dimension corresponds to the time or spatial dimension, and the third dimension corresponds to the batch dimension.

Create an array of sequence data.

numChannels = 10; numObservations = 128; numTimeSteps = 100;

X = rand(numChannels,numObservations,numTimeSteps); X = dlarray(X); size(X)

Specify the number of heads for multihead attention.

Initialize the learnable parameters for multihead attention.

The learnable query, key, and value weights must be (numChannels*numHeads)-by-numChannels arrays.
The learnable output weights must be a (numChannels*numHeads)-by-(numChannels*numHeads) array.

outputSize = numChannels*numHeads;

WQ = rand(outputSize,numChannels); WK = rand(outputSize,numChannels); WV = rand(outputSize,numChannels); WO = rand(outputSize,outputSize);

Apply the multihead self attention operation.

Y = multiheadSelfAttention(X,numHeads,WQ,WK,WV,WO);

View the size of the output. The output has size (numChannels*numHeads)-by-numObservations-by-(numTimeSteps).

Multihead Self Attention Function

The multiheadSelfAttention function takes as input the data X, the number of heads, and the learnable weights for the queries, keys, values, and output data, and returns the multihead attention values.

The X input must be an unformatted dlarray object, where the first dimension corresponds to the input channels, the second dimension corresponds to the time or spatial dimension, and the third dimension corresponds to the batch dimension.
The learnable query, key, and value weight matrices are (numChannels*numHeads)-by-numChannels matrices.
The learnable output weights matrix is a (numChannels*numHeads)-by-(numChannels*numHeads) matrix.

function Y = multiheadSelfAttention(X,numHeads,WQ,WK,WV,WO)

queries = pagemtimes(WQ,X); keys = pagemtimes(WK,X); values = pagemtimes(WV,X);

A = attention(queries,keys,values,numHeads,DataFormat="CBT");

Y = pagemtimes(WO,A);

end

Create Luong Attention Function

You can use the attention function to create a function that applies the Luong attention operation to its input. Create the luongAttention function, listed at the end of the example, that applies the Luong attention operation.

Specify the array sizes.

numHiddenUnits = 100; latentSize = 16;

Create random arrays containing the input data.

hiddenState = dlarray(rand(numHiddenUnits,1)); Z = dlarray(rand(latentSize,1)); weights = dlarray(rand(numHiddenUnits,latentSize));

Apply the luongAttention function.

[context,scores] = luongAttention(hiddenState,Z,weights);

View the sizes of the outputs.

Luong Attention Function

The luongAttention function returns the context vector and attention scores according to the Luong "general" scoring [2]. This operation is equivalent to dot-product attention with queries, keys, and values specified as the hidden state, the weighted latent representation, and the latent representation, respectively.

function [context,scores] = luongAttention(hiddenState,Z,weights)

numHeads = 1; queries = hiddenState; keys = pagemtimes(weights,Z); values = Z;

[context,scores] = attention(queries,keys,values,numHeads, ... Scale=1, ... DataFormat="CBT");

end

Input Arguments

collapse all

`queries` — Queries

dlarray object

Queries, specified as a dlarray object.

queries can have at most one "S" (spatial) or "T" (time) dimension. Any dimensions inqueries labeled "U" (unspecified) must be singleton. If queries is an unformatted dlarray object, then specify the data format using the DataFormat option.

The size of the "C" (channel) dimension in keys must match the size of the corresponding dimension in queries.

The size of the "B" (batch) dimension in queries, keys, and values must match.

`keys` — Keys

dlarray object | numeric array

Keys, specified as a dlarray object or a numeric array.

If keys is a formatted dlarray object, then its format must match the format of queries. Ifkeys is not a formatted dlarray object, then the function uses the same format as queries.

The size of any "S" (spatial) or "T" (time) dimensions in keys must match the size of the corresponding dimension in values.

The size of the "C" (channel) dimension in keys must match the size of the corresponding dimension in queries.

The size of the "B" (batch) dimension in queries, keys, and values must match.

`values` — Values

dlarray object | numeric array

Values, specified as a dlarray object or a numeric array.

If values is a formatted dlarray object, then its format must match the format of queries. Otherwise, the function uses the same format as queries.

The size of any "S" (spatial) or "T" (time) dimensions in keys must match the size of the corresponding dimension in values.

The size of the "B" (batch) dimension in queries, keys, and values must match.

`numHeads` — Number of heads

positive integer

Number of heads, specified as a positive integer.

Each head performs a separate linear transformation of the input and computes attention weights independently. The layer uses these attention weights to compute a weighted sum of the input representations, generating a context vector. Increasing the number of heads lets the model capture different types of dependencies and attend to different parts of the input simultaneously. Reducing the number of heads can lower the computational cost of the layer.

The value of numHeads must evenly divide the size of the"C" (channel) dimension of queries,keys, and values.

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: attention(queries,keys,values,numHeads,DataFormat="CBT") applies the attention operation for unformatted data and specifies the data format"CBT" (channel, batch, time).

`DataFormat` — Description of data dimensions

character vector | string scalar

Description of the data dimensions, specified as a character vector or string scalar.

A data format is a string of characters, where each character describes the type of the corresponding data dimension.

The characters are:

"S" — Spatial
"C" — Channel
"B" — Batch
"T" — Time
"U" — Unspecified

For example, consider an array containing a batch of sequences where the first, second, and third dimensions correspond to channels, observations, and time steps, respectively. You can specify that this array has the format "CBT" (channel, batch, time).

You can specify multiple dimensions labeled "S" or "U". You can use the labels "C", "B", and"T" once each, at most. The software ignores singleton trailing"U" dimensions after the second dimension.

If the input data is not a formatted dlarray object, then you must specify the DataFormat option.

For more information, see Deep Learning Data Formats.

Data Types: char | string

`Scale` — Multiplicative factor for scaled dot-product attention

"auto" (default) | numeric scalar

Multiplicative factor for scaled dot-product attention [1], specified as one of these values:

"auto" — Multiply the dot-product by λ=1dk, where dk denotes the number of channels in the keys divided by the number of heads.
Numeric scalar — Multiply the dot-product by the specified scale factor.

Data Types: single | double | char | string

`PaddingMask` — Mask indicating padding values

dlarray object | logical array | binary-valued numeric array

Mask indicating which elements of the input correspond to padding values, specified as a dlarray object, a logical array, or a binary-valued numeric array.

The function prevents and allows attention to elements of input data key-value pairs when the corresponding element in PaddingMask is0 and 1, respectively.

If PaddingMask is a formatted dlarray object, then its format must match that of keys. IfPaddingMask is not a formatted dlarray object, then the function uses the same format as keys. The size of the"S" (spatial), "T" (time), and"B" (batch) dimensions in PaddingMask must match the size of the corresponding dimensions in keys andvalues.

The padding mask can have any number of channels. The software uses the values in the first channel only to indicate padding values.

The default value is a logical array of ones with the same size askeys.

`AttentionMask` — Attention mask

"none" (default) | "causal" | numeric array | logical array

Attention mask indicating which elements to include when applying the attention operation, specified as one of these values:

"none" — Do not prevent attention to elements with respect to their positions. If AttentionMask is"none", then the software prevents attention using only the padding mask.
"causal" — Prevent elements in position_m_ in the "S" (spatial) or"T" (time) dimension of the input queries from providing attention to the elements in positions n, where_n_ is greater than m in the corresponding dimension of the input keys and values. Use this option for auto-regressive models.
Logical or numeric array — Prevent attention to elements of the input keys and values when the corresponding element in the specified array is0. The specified array must be an_Nk_-by-Nq matrix or a_Nk_-by-_Nq_-by-numObservations array, Nk is the size of the"S" (spatial) or "T" (time) dimension of the input keys, Nq is the size of the corresponding dimension of the input queries, andnumObservations is the size of the"B" dimension in the input queries.

`DropoutProbability` — Dropout probability

0 (default) | scalar in the range [0, 1)

Dropout probability for the attention weights, specified as a scalar in the range [0, 1).

Data Types: single | double

Output Arguments

collapse all

`Y` — Result of attention operation

dlarray object

Result of attention operation, returned as a dlarray object.

If queries is a formatted dlarray object, thenY is a formatted dlarray object with the same dimension labels as queries. The size of the"C" (channel) dimension of Y is the same as the size of the corresponding dimension in values. The size of the"S" (spatial) or "T" dimension ofY is the same size as the corresponding dimension inqueries.

If queries is not a formatted dlarray object, then Y is an unformatted dlarray object.

`weights` — Attention weights

unformatted dlarray object

Attention weights, returned as an unformatted dlarray object.

weights is a_Nk_-by-_Nq_-by-numHeads-by-numObservations array, where Nk is the size of the"S" (spatial) or "T" (time) dimension ofkeys, Nq is the size of the corresponding dimension in queries, andnumObservations is the size of the "B" (batch) dimension in queries.

Algorithms

collapse all

Dot-Product Attention

The attention operation focuses on parts of the input using weighted multiplication operations.

The single-head dot-product attention operation is given by

where:

Q denotes the queries.
K denotes the keys.
V denotes the values.
λ denotes the scaling factor.
M is a mask array of ones and zeros.
p is the dropout probability.

The mask operation includes or excludes the values of the matrix multiplication setting values of the input to −∞ for zero-valued mask elements. The mask is the union of the padding and attention masks. The softmax function normalizes the value of the input data across the channel dimension such that it sums to one. The dropout operation sets elements to zero with probability p.

Multihead Self-Attention

The multihead self-attention operation for the input X is given by

where:

h is the number of heads.
WQ is a learnable projection matrix for the queries.
WK is a learnable projection matrix for the keys.
WV is a learnable projection matrix for the values.
WO is a learnable projection matrix for the output.

Each weight matrix is composed of concatenated weight matrices Wi for each head. Each headi denotes the output of the head operation given by

Deep Learning Array Formats

Most deep learning networks and functions operate on different dimensions of the input data in different ways.

For example, an LSTM operation iterates over the time dimension of the input data, and a batch normalization operation normalizes over the batch dimension of the input data.

To provide input data with labeled dimensions or input data with additional layout information, you can use data formats.

A data format is a string of characters, where each character describes the type of the corresponding data dimension.

The characters are:

"S" — Spatial
"C" — Channel
"B" — Batch
"T" — Time
"U" — Unspecified

To create formatted input data, create a dlarray object and specify the format using the second argument.

To provide additional layout information with unformatted data, specify the format using theDataFormat argument.

For more information, see Deep Learning Data Formats.

References

[1] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (December 2017): 6000-6010. https://papers.nips.cc/paper/7181-attention-is-all-you-need.

[2] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation."arXiv preprint arXiv:1508.04025 (2015).

Extended Capabilities

GPU Arrays

Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

The attention function supports GPU array input with these usage notes and limitations:

When at least one of these input arguments is a gpuArray object or a dlarray object with underlying data of typegpuArray, this function runs on the GPU.
- queries
- keys
- values

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2022b

attention - Dot-product attention - MATLAB (original) (raw)

Syntax

Description

Examples

Apply Attention Operation

Create Multihead Self Attention Function

Create Luong Attention Function

Input Arguments

queries — Queries

keys — Keys

values — Values

numHeads — Number of heads

Name-Value Arguments

DataFormat — Description of data dimensions

Scale — Multiplicative factor for scaled dot-product attention

PaddingMask — Mask indicating padding values

AttentionMask — Attention mask

DropoutProbability — Dropout probability

Output Arguments

Y — Result of attention operation

weights — Attention weights

Algorithms

Dot-Product Attention

Multihead Self-Attention

Deep Learning Array Formats

References

Extended Capabilities

GPU Arrays

Version History

`queries` — Queries

`keys` — Keys

`values` — Values

`numHeads` — Number of heads

`DataFormat` — Description of data dimensions

`Scale` — Multiplicative factor for scaled dot-product attention

`PaddingMask` — Mask indicating padding values

`AttentionMask` — Attention mask

`DropoutProbability` — Dropout probability

`Y` — Result of attention operation

`weights` — Attention weights