padsequences - Pad or truncate sequence data to same length - MATLAB (original) (raw)

Pad or truncate sequence data to same length

Since R2021a

Syntax

Description

[XPad](#mw%5Fe16c1fa2-3a32-401b-966a-dca50ed4a942) = padsequences([X](#mw%5F0ca9848b-75b7-4ed1-95e2-bdcd894f9781),[paddingDim](#mw%5Fbc8c449a-4c32-4a27-8069-2ebb55cb9035)) pads the sequences in the cell array X along the dimension specified bypaddingDim. The function adds padding at the end of each sequence to match the size of the longest sequence in X. The padded sequences are concatenated and the function returns XPad as an array.

example

[[XPad](#mw%5Fe16c1fa2-3a32-401b-966a-dca50ed4a942),[mask](#mw%5F7cdb0535-3c27-4225-a105-83eaacab37d4)] = padsequences([X](#mw%5F0ca9848b-75b7-4ed1-95e2-bdcd894f9781),[paddingDim](#mw%5Fbc8c449a-4c32-4a27-8069-2ebb55cb9035)) additionally returns a logical array representing the positions of original sequence data inXPad. The position of values of true or1 in mask correspond to the positions of original sequence data in XPad; values of false or0 correspond to padded values.

example

[___] = padsequences([X](#mw%5F0ca9848b-75b7-4ed1-95e2-bdcd894f9781),[paddingDim](#mw%5Fbc8c449a-4c32-4a27-8069-2ebb55cb9035),[Name,Value](#namevaluepairarguments)) specifies options using one or more name-value arguments in addition to the input and output arguments in previous syntaxes. For example, 'PaddingValue','left' adds padding to the beginning of the original sequence.

example

Examples

collapse all

Pad sequence data ready for training.

Load the sequence data and view the sizes of the first few sequences. The sequences have different lengths.

load WaveformData data(1:5)

ans=5×1 cell array {103×3 double} {136×3 double} {140×3 double} {124×3 double} {127×3 double}

Pad the data with zeros to the same length as the longest sequence. The function applies on the right side of the data. Specify the dimension containing the time steps as the padding dimension. For this example, the dimension is 1.

dataPadded = padsequences(data,1);

Examine the size of the padded sequences.

Use padsequences to extend or cut each sequence to a fixed length by adding or removing data at both ends of the sequence, depending on the length of the original sequence.

Load the sequence data.

View the sizes of the first few sequences. The sequences have different lengths.

ans=10×1 cell array {103×3 double} {136×3 double} {140×3 double} {124×3 double} {127×3 double} {200×3 double} {141×3 double} {151×3 double} {149×3 double} {112×3 double}

Process the data so that each sequence is exactly 128 time steps. For shorter sequences, padding is required, while longer sequences need to be truncated. Pad or truncate at both sides of the data. For the padded sequences, apply symmetric padding so that the padded values are mirror reflections of the original sequence values.

[dataPadded,mask] = padsequences(data,1,'Length',128,'Direction','both','PaddingValue','symmetric');

Compare some of the padded sequences with the original sequence. Each observation contains 12 features so extract a single feature to compare.

View the size of the first observation. This sequence is shorter than 128 time steps.

View the size of the padded array.

The function centers the sequence and pads at both ends by reflecting the values at the ends of the sequence. The mask shows the location of the original sequence values. View the first and last few time steps of the mask.

ans = 20×1 logical array

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 ⋮

ans = 20×1 logical array

1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 ⋮

View the size of the third observation. This sequence is longer than 128 time steps.

The function centers the sequence and truncates at both ends. The mask shows that all data in the resulting sequence is part of the original sequence. View the first and last few time steps of the mask.

ans = 20×1 logical array

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ⋮

ans = 20×1 logical array

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ⋮

Use the padsequences function in conjunction with minibatchqueue to prepare and preprocess sequence data ready for training using a custom training loop.

The example uses the human activity recognition training data. The data contains six time series of sensor data obtained from a smartphone worn on the body. Each sequence has three features and varies in length. The three features correspond to the accelerometer readings in three different directions.

Load the training data. Combine the data and labels into a single datastore.

s = load("HumanActivityTrain.mat");

dsXTrain = arrayDatastore(s.XTrain,'OutputType','same'); dsYTrain = arrayDatastore(s.YTrain,'OutputType','same');

dsTrain = combine(dsXTrain,dsYTrain);

Use minibatchqueue to process the mini-batches of sequence data. Define a custom mini-batch preprocessing function preprocessMiniBatch (defined at the end of this example) to pad the sequence data and labels, and one-hot encode the label sequences. To also return the mask of the padded data, specify three output variables for the minibatchqueue object.

miniBatchSize = 2; mbq = minibatchqueue(dsTrain,3,... 'MiniBatchSize',miniBatchSize,... 'MiniBatchFcn', @preprocessMiniBatch);

Check the size of the mini-batches.

[X,Y,mask] = next(mbq); size(X)

Each mini-batch has two observations. The function pads the sequences to the same size as the longest sequence in the mini-batch. The mask is the same size as the padded sequences, and shows the location of the original data values in the padded sequence data.

The padded labels are one-hot encoded into numeric data ready for training.

function [xPad,yPad,mask] = preprocessMiniBatch(X,Y) [xPad,mask] = padsequences(X,2); yPad = padsequences(Y,2); yPad = onehotencode(yPad,1); end

Input Arguments

collapse all

Sequences to pad, specified as a cell vector of numeric or categorical arrays.

Data Types: cell

Dimension along which to pad input sequence data, specified as a positive integer.

Example: 2

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Name-Value Arguments

collapse all

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: padsequences(X,'Length','shortest','Direction','both') truncates the sequences at each end, to match the length of the shortest input sequence.

Length of padded sequences, specified as one of the following:

Example: padsequences(X,'Length','shortest')

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | char | string

Direction of padding or truncation, specified as one of the following:

Example: padsequences(X,'Direction','both')

Data Types: char | string

Value used to pad input, specified as one of the following:

Example: padsequences(X,'PaddingValue','symmetric')

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | categorical

Flag to return padded data as a uniform array, specified as a numeric or logical1 (true) or 0 (false). When you set the value to 0,XPad is returned as a cell vector with the same size and underlying data type as the input X.

Example: padsequences(X,'UniformOutput',0)

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical

Output Arguments

collapse all

Padded sequence data, returned as a numeric array, categorical array, or a cell vector of numeric or categorical arrays.

If you set the UniformOutput name-value option totrue or 1, the function concatenates the padded sequences over the last dimension. The last dimension of XPad has the same size as the number of sequences in input X.XPad is an array with N + 1 dimensions, where N is the number of dimensions of the sequence arrays in X. XPad has the same data type as the arrays in input X.

If you set the UniformOutput name-value option tofalse or 0, the function returns the padded sequences as a cell vector with the same size and underlying data type as the inputX.

Position of original sequence data in the padded sequences, returned as a logical array or as a cell vector of logical arrays.

mask has the same size and data type asXPad. Values of 1 in mask correspond to positions of original sequence values in XPad. Values of 0 correspond to padded values.

Use mask to excluded padded values from loss calculations using the "Mask" name-value option in the crossentropy function.

Version History

Introduced in R2021a