replaceNgrams - Replace n-grams in documents - MATLAB (original) (raw)

Replace n-grams in documents

Syntax

Description

[newDocuments](#d126e43250) = replaceNgrams([documents](#d126e43068),[oldNgrams](#mw%5F9cc3d909-4139-4f7f-9132-b6a9f9907abc),[newNgrams](#mw%5Fad9e9876-9ca5-421e-a290-144858e0d569)) updates the specified documents by replacing the n-grams oldNgrams with the corresponding n-grams in newNgrams. The function, by default, is case sensitive.

example

[newDocuments](#d126e43250) = replaceNgrams([documents](#d126e43068),[oldNgrams](#mw%5F9cc3d909-4139-4f7f-9132-b6a9f9907abc),[newNgrams](#mw%5Fad9e9876-9ca5-421e-a290-144858e0d569),'IgnoreCase',true) replaces the n-grams oldNgrams ignoring case.

Examples

collapse all

Use the replaceNgrams function to replace abbreviations with their corresponding expanded forms.

Create an array of tokenized documents.

str = [ ... "Currently in Cambridge, MA." "Next stop, NY!"]; documents = tokenizedDocument(str)

documents = 2×1 tokenizedDocument:

6 tokens: Currently in Cambridge , MA .
5 tokens: Next stop , NY !

Replace the tokens "MA" and "NY" with "Massachusetts" and ["New" "York"] respectively. If the n-grams have different lengths, you must pad the rows with the empty string "". In this case, you must pad "Massachusetts" with a single empty string "".

oldNgrams = [ "MA" "NY"]; newNgrams = [ "Massachusetts" "" "New" "York"]; documents = replaceNgrams(documents,oldNgrams,newNgrams)

documents = 2×1 tokenizedDocument:

6 tokens: Currently in Cambridge , Massachusetts .
6 tokens: Next stop , New York !

Input Arguments

collapse all

N-grams to replace, specified as a string array, character vector, or a cell array of character vectors.

If oldNgrams is a string array or cell array, then it has sizeNumNgrams-by-maxN , whereNumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If oldNgrams is a character vector, then it represents a single word (unigram).

The value of oldNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of theith row of oldNgrams must be padded with the empty string "".

For example, to specify both the unigram "Massachusetts", and the bigram ["New" "York"], specify the 2-by-2 string array["Massachusetts" ""; "New" "York"], where"Massachusetts" is padded with a single empty string"".

Data Types: string | char | cell

New n-grams, specified as a string array, character vector, or a cell array of character vectors.

If newNgrams is a string array or cell array, then it has sizeNumNgrams-by-maxN , whereNumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If newNgrams is a character vector, then it represents a single word (unigram).

The value of newNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of theith row of newNgrams are empty.

newNgrams must have one row, or the same number of rows asoldNgrams.

For example, to specify both the unigram "Massachusetts", and the bigram ["New" "York"], specify the 2-by-2 string array["Massachusetts" ""; "New" "York"], where"Massachusetts" is padded with a single empty string"".

Data Types: string | char | cell

Output Arguments

Version History

Introduced in R2019a