doc2sequence - Convert documents to sequences for deep learning - MATLAB (original) (raw)
Convert documents to sequences for deep learning
Syntax
Description
[sequences](#mw%5F9ade2993-2069-467a-97ce-679e1dbf9097) = doc2sequence([enc](#d126e17445),[documents](#d126e17461))
returns a cell array of the numeric indices of the words in documents
given by the word encoding enc
. Each element ofsequences
is a vector of the indices of the words in the corresponding document.
[sequences](#mw%5F9ade2993-2069-467a-97ce-679e1dbf9097) = doc2sequence([emb](#d126e17429),[documents](#d126e17461))
returns a cell array of the embedding vectors of the words in documents
given by the word embedding emb
. Each element ofsequences
is a matrix of the embedding vectors of the words in the corresponding document.
[sequences](#mw%5F9ade2993-2069-467a-97ce-679e1dbf9097) = doc2sequence(___,[Name,Value](#namevaluepairarguments))
specifies additional options using one or more name-value pair arguments.
Examples
Convert Documents to Sequences of Word Indices
Load the factory reports data and create a tokenizedDocument
array.
filename = "factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);
Create a word encoding.
enc = wordEncoding(documents);
Convert the documents to sequences of word indices.
sequences = doc2sequence(enc,documents);
View the sizes of the first 10 sequences. Each sequence is a 1-by-S vector, where S is the number of word indices in the sequence. Because the sequences are padded, S is constant.
ans=10×1 cell array {[ 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10]} {[ 0 0 0 0 0 0 11 12 13 14 15 2 16 17 18 19 10]} {[ 0 0 0 0 0 0 20 2 21 22 7 23 24 25 7 26 10]} {[ 0 0 0 0 0 0 0 0 0 0 0 27 28 6 7 18 10]} {[ 0 0 0 0 0 0 0 0 0 0 0 0 29 30 7 31 10]} {[ 0 0 0 0 0 0 0 32 33 6 7 34 35 36 37 38 10]} {[ 0 0 0 0 0 0 0 0 0 39 40 36 41 6 7 42 10]} {[ 0 0 0 0 0 0 0 0 43 44 22 45 46 47 7 48 10]} {[ 0 0 0 0 0 0 0 0 0 0 49 50 17 7 51 48 10]} {[0 0 0 0 52 8 53 36 54 55 56 57 58 59 22 60 10]}
Convert Documents to Sequences of Word Vectors
Convert an array of tokenized documents to sequences of word vectors using a pretrained word embedding.
Load a pretrained word embedding using the fastTextWordEmbedding
function. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.
emb = fastTextWordEmbedding;
Load the factory reports data and create a tokenizedDocument
array.
filename = "factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);
Convert the documents to sequences of word vectors using doc2sequence
. The doc2sequence
function, by default, left-pads the sequences to have the same length. When converting large collections of documents using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from padding the data, set the 'PaddingDirection'
option to 'none'
. Alternatively, you can control the amount of padding using the 'Length'
option.
sequences = doc2sequence(emb,documents,'PaddingDirection','none');
View the sizes of the first 10 sequences. Each sequence is _D_-by-S matrix, where D is the embedding dimension, and S is the number of word vectors in the sequence.
ans=10×1 cell array {300×10 single} {300×11 single} {300×11 single} {300×6 single} {300×5 single} {300×10 single} {300×8 single} {300×9 single} {300×7 single} {300×13 single}
Pad or Truncate Sequences to Specified Length
Convert a collection of documents to sequences of word vectors using a pretrained word embedding, and pad or truncate the sequences to a specified length.
Load a pretrained word embedding using fastTextWordEmbedding
. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.
emb = fastTextWordEmbedding;
Load the factory reports data and create a tokenizedDocument
array.
filename = "factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);
Convert the documents to sequences of word vectors. Specify to left-pad or truncate the sequences to have length 100.
sequences = doc2sequence(emb,documents,'Length',100);
View the sizes of the first 10 sequences. Each sequence is _D_-by-S matrix, where D is the embedding dimension, and S is the number of word vectors in the sequence (the sequence length). Because the sequence length is specified, S is constant.
ans=10×1 cell array {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single} {300×100 single}
Input Arguments
emb
— Input word embedding
wordEmbedding
object
Input word embedding, specified as a wordEmbedding object.
enc
— Input word encoding
wordEncoding
object
Input word encoding, specified as a wordEncoding object.
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument array.
Name-Value Arguments
Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN
, where Name
is the argument name and Value
is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose Name
in quotes.
Example: 'Length','shortest'
truncates the sequences to have the same length as the shortest sequence.
UnknownWord
— Unknown word behavior
'discard'
(default) | 'nan'
Unknown word behavior, specified as the comma-separated pair consisting of'UnknownWord'
and one of the following:
'discard'
– If a word is not in the input map, then discard it.'nan'
– If a word is not in the input map, then return aNaN
value.
Tip
If you are creating sequences for training a deep learning network with a word embedding, use 'discard'
. Do not use sequences withNaN
values, because doing so can propagate errors through the network.
PaddingDirection
— Padding direction
'left'
(default) | 'right'
| 'none'
Padding direction, specified as the comma-separated pair consisting of'PaddingDirection'
and one of the following:
'left'
– Pad sequences on the left.'right'
– Pad sequences on the right.'none'
– Do not pad sequences.
Tip
When converting large collections of data using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from adding too much padding, set the 'PaddingDirection' option to 'none'
or set 'Length' to a smaller value.
PaddingValue
— Padding value
0 (default) | numeric scalar
Padding value, specified as the comma-separated pair consisting of'PaddingValue'
and a numeric scalar. Do not pad sequences withNaN
, because doing so can propagate errors through the network.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Length
— Sequence length
'longest'
(default) | 'shortest'
| positive integer
Sequence length, specified as the comma-separated pair consisting of'Length'
and one of the following:
'longest'
– Pad sequences to have the same length as the longest sequence.'shortest'
– Truncate sequences to have the same length as the shortest sequence.- Positive integer – Pad or truncate sequences to have the specified length. The function truncates the sequences on the right.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| char
| string
Output Arguments
sequences
— Output sequences
cell array
Output sequences, returned as a cell array.
For word embedding input, the _i_th element ofsequences
is a matrix of the word vectors corresponding to the_i_th input document.
For word encoding input, the _i_th element ofsequences
is a vector of the word encoding indices corresponding to the _i_th input document.
Tips
- When converting large collections of data using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from adding too much padding, set the 'PaddingDirection' option to
'none'
or set 'Length' to a smaller value.
Version History
Introduced in R2018b