replace - Replace substrings in documents - MATLAB (original) (raw)

Main Content

Replace substrings in documents

Syntax

Description

[newDocuments](#d126e42665) = replace([documents](#d126e42577),[old](#d126e42593),[new](#d126e42631)) replaces all occurrences of the substring or pattern old indocuments with new.

Tip

Use the replace function to replace substrings of the words in documents by specifying substrings or patterns. To replace entire words and n-grams in documents, use the replaceWords and replaceNgrams functions respectively.

example

Examples

collapse all

Replace words in a document array.

documents = tokenizedDocument([ "an extreme example" "another extreme example"])

documents = 2×1 tokenizedDocument:

3 tokens: an extreme example
3 tokens: another extreme example

newDocuments = replace(documents,"example","sentence")

newDocuments = 2×1 tokenizedDocument:

3 tokens: an extreme sentence
3 tokens: another extreme sentence

Replace substrings of the words.

newDocuments = replace(documents,"ex","X-")

newDocuments = 2×1 tokenizedDocument:

3 tokens: an X-treme X-ample
3 tokens: another X-treme X-ample

Remove digits from a document using a digits pattern.

Create an array of tokenized documents.

textData = [ "Text Analytics Toolbox provides over 50 functions to analyze text data." "The bm25Similarity function measures document similarity."]; documents = tokenizedDocument(textData);

Replace instances of consecutive digits with the token "<NUMBER>" using the replace function. Specify a digits pattern using the digitsPattern function.

pat = digitsPattern; newDocuments = replace(documents,pat,"")

newDocuments = 2×1 tokenizedDocument:

12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
 7 tokens: The bm<NUMBER>Similarity function measures document similarity .

Notice that the function replaces the digits in the token "bm25Similarity".

To replace tokens consisting entirely of digits, use the replace function and specify a pattern that also includes text boundaries. Specify text boundaries using the textBoundary function.

pat = textBoundary + digitsPattern + textBoundary; newDocuments = replace(documents,pat,"")

newDocuments = 2×1 tokenizedDocument:

12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
 7 tokens: The bm25Similarity function measures document similarity .

In this case, the function does not replace the digits in the token "bm25Similarity".

Input Arguments

collapse all

Substring or pattern to replace, specified as one of the following:

New substring, specified as a string array, character vector, or cell array of character vectors.

Data Types: string | char | cell

Output Arguments

Version History

Introduced in R2017b