removeEmptyDocuments - Remove empty documents from tokenized document array, bag-of-words model, or
bag-of-n-grams model - MATLAB ([original](https://in.mathworks.com/help/textanalytics/ref/tokenizeddocument.removeemptydocuments.html)) ([raw](?raw))
Main Content
Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
Syntax
Description
[newDocuments](#d126e40285) = removeEmptyDocuments([documents](#d126e40235))
removes documents which have no words from documents
.
[newBag](#d126e40301) = removeEmptyDocuments([bag](#d126e40251))
removes documents which have no words or n-grams from the bag-of-words or bag-of-n-grams model bag
.
[___,[idx](#d126e40335)] = removeEmptyDocuments(___)
also returns the indices of the removed documents.
Examples
Remove documents containing no words from an array of tokenized documents.
Create an array of tokenized documents which includes empty documents.
documents = tokenizedDocument([ "an example of a short sentence" "" "a second short sentence" ""])
documents = 4×1 tokenizedDocument:
6 tokens: an example of a short sentence
0 tokens:
4 tokens: a second short sentence
0 tokens:
Remove the empty documents.
newDocuments = removeEmptyDocuments(documents)
newDocuments = 2×1 tokenizedDocument:
6 tokens: an example of a short sentence
4 tokens: a second short sentence
Remove documents containing no words from bag-of-words model.
Create a bag-of-words model from an array of tokenized documents.
documents = tokenizedDocument([ "An example of a short sentence." "" "A second short sentence." ""]); bag = bagOfWords(documents)
bag = bagOfWords with properties:
NumWords: 9
Counts: [4×9 double]
Vocabulary: ["An" "example" "of" "a" "short" "sentence" "." "A" "second"]
NumDocuments: 4
Remove the empty documents from the bag-of-words model.
newBag = removeEmptyDocuments(bag)
newBag = bagOfWords with properties:
NumWords: 9
Counts: [2×9 double]
Vocabulary: ["An" "example" "of" "a" "short" "sentence" "." "A" "second"]
NumDocuments: 2
Remove documents containing no words from an array and use the indices of removed documents to remove the corresponding labels also.
Create an array of tokenized documents which includes empty documents.
documents = tokenizedDocument([ "an example of a short sentence" "" "a second short sentence" ""])
documents = 4×1 tokenizedDocument:
6 tokens: an example of a short sentence
0 tokens:
4 tokens: a second short sentence
0 tokens:
Create a vector of labels.
labels = ["T"; "F"; "F"; "T"]
labels = 4×1 string "T" "F" "F" "T"
Remove the empty documents and get the indices of the removed documents.
[newDocuments, idx] = removeEmptyDocuments(documents)
newDocuments = 2×1 tokenizedDocument:
6 tokens: an example of a short sentence
4 tokens: a second short sentence
Remove the corresponding labels from labels
.
labels = 2×1 string "T" "F"
Input Arguments
Output Arguments
Output model, returned as a bagOfWords object or a bagOfNgrams object. The type ofnewBag
is the same as the type ofbag.
Indices of removed documents, returned as a vector of positive integers.
Version History
Introduced in R2017b