fastTextWordEmbedding - Pretrained fastText word embedding - MATLAB (original) (raw)
Main Content
Pretrained fastText word embedding
Syntax
Description
[emb](#d126e24723) = fastTextWordEmbedding
returns a 300-dimensional pretrained word embedding for 1 million English words.
This function requires the Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, the function provides a download link.
Examples
Download and install the Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding support package.
Type fastTextWordEmbedding
at the command line.
If the Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding support package is not installed, then the function provides a link to the required support package in the Add-On Explorer. To install the support package, click the link, and then click Install. Check that the installation is successful by typing emb = fastTextWordEmbedding
at the command line.
emb = fastTextWordEmbedding
emb =
wordEmbedding with properties:
Dimension: 300
Vocabulary: [1×1000000 string]
If the required support package is installed, then the function returns awordEmbedding
object.
Load a pretrained word embedding using fastTextWordEmbedding
. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.
emb = fastTextWordEmbedding
emb = wordEmbedding with properties:
Dimension: 300
Vocabulary: [1×1000000 string]
Map the words "Italy", "Rome", and "Paris" to vectors using word2vec
.
italy = word2vec(emb,"Italy"); rome = word2vec(emb,"Rome"); paris = word2vec(emb,"Paris");
Map the vector italy - rome + paris
to a word using vec2word
.
word = vec2word(emb,italy - rome + paris)
Convert an array of tokenized documents to sequences of word vectors using a pretrained word embedding.
Load a pretrained word embedding using the fastTextWordEmbedding
function. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.
emb = fastTextWordEmbedding;
Load the factory reports data and create a tokenizedDocument
array.
filename = "factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);
Convert the documents to sequences of word vectors using doc2sequence
. The doc2sequence
function, by default, left-pads the sequences to have the same length. When converting large collections of documents using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from padding the data, set the 'PaddingDirection'
option to 'none'
. Alternatively, you can control the amount of padding using the 'Length'
option.
sequences = doc2sequence(emb,documents,'PaddingDirection','none');
View the sizes of the first 10 sequences. Each sequence is _D_-by-S matrix, where D is the embedding dimension, and S is the number of word vectors in the sequence.
ans=10×1 cell array {300×10 single} {300×11 single} {300×11 single} {300×6 single} {300×5 single} {300×10 single} {300×8 single} {300×9 single} {300×7 single} {300×13 single}
Output Arguments
Pretrained word embedding, returned as a wordEmbedding
object.
Version History
Introduced in R2018a