replaceNgrams - Replace n-grams in documents - MATLAB (original) (raw)
Replace n-grams in documents
Syntax
Description
[newDocuments](#d126e43250) = replaceNgrams([documents](#d126e43068),[oldNgrams](#mw%5F9cc3d909-4139-4f7f-9132-b6a9f9907abc),[newNgrams](#mw%5Fad9e9876-9ca5-421e-a290-144858e0d569))
updates the specified documents by replacing the n-grams oldNgrams
with the corresponding n-grams in newNgrams
. The function, by default, is case sensitive.
[newDocuments](#d126e43250) = replaceNgrams([documents](#d126e43068),[oldNgrams](#mw%5F9cc3d909-4139-4f7f-9132-b6a9f9907abc),[newNgrams](#mw%5Fad9e9876-9ca5-421e-a290-144858e0d569),'IgnoreCase',true)
replaces the n-grams oldNgrams
ignoring case.
Examples
Use the replaceNgrams
function to replace abbreviations with their corresponding expanded forms.
Create an array of tokenized documents.
str = [ ... "Currently in Cambridge, MA." "Next stop, NY!"]; documents = tokenizedDocument(str)
documents = 2×1 tokenizedDocument:
6 tokens: Currently in Cambridge , MA .
5 tokens: Next stop , NY !
Replace the tokens "MA"
and "NY"
with "Massachusetts"
and ["New" "York"]
respectively. If the n-grams have different lengths, you must pad the rows with the empty string ""
. In this case, you must pad "Massachusetts"
with a single empty string ""
.
oldNgrams = [ "MA" "NY"]; newNgrams = [ "Massachusetts" "" "New" "York"]; documents = replaceNgrams(documents,oldNgrams,newNgrams)
documents = 2×1 tokenizedDocument:
6 tokens: Currently in Cambridge , Massachusetts .
6 tokens: Next stop , New York !
Input Arguments
N-grams to replace, specified as a string array, character vector, or a cell array of character vectors.
If oldNgrams
is a string array or cell array, then it has sizeNumNgrams
-by-maxN
, whereNumNgrams
is the number of n-grams, and maxN
is the length of the largest n-gram. If oldNgrams
is a character vector, then it represents a single word (unigram).
The value of oldNgrams(i,j)
is the j
th word of the i
th n-gram. If the number of words in the i
th n-gram is less than maxN
, then the remaining entries of thei
th row of oldNgrams
must be padded with the empty string ""
.
For example, to specify both the unigram "Massachusetts"
, and the bigram ["New" "York"]
, specify the 2-by-2 string array["Massachusetts" ""; "New" "York"]
, where"Massachusetts"
is padded with a single empty string""
.
Data Types: string
| char
| cell
New n-grams, specified as a string array, character vector, or a cell array of character vectors.
If newNgrams
is a string array or cell array, then it has sizeNumNgrams
-by-maxN
, whereNumNgrams
is the number of n-grams, and maxN
is the length of the largest n-gram. If newNgrams
is a character vector, then it represents a single word (unigram).
The value of newNgrams(i,j)
is the j
th word of the i
th n-gram. If the number of words in the i
th n-gram is less than maxN
, then the remaining entries of thei
th row of newNgrams
are empty.
newNgrams
must have one row, or the same number of rows asoldNgrams.
For example, to specify both the unigram "Massachusetts"
, and the bigram ["New" "York"]
, specify the 2-by-2 string array["Massachusetts" ""; "New" "York"]
, where"Massachusetts"
is padded with a single empty string""
.
Data Types: string
| char
| cell
Output Arguments
Version History
Introduced in R2019a