replace - Replace substrings in documents - MATLAB (original) (raw)

Main Content

Replace substrings in documents

Syntax

Description

[newDocuments](#d126e42691) = replace([documents](#d126e42603),[old](#d126e42619),[new](#d126e42657)) replaces all occurrences of the substring or pattern old indocuments with new.

Tip

Use the replace function to replace substrings of the words in documents by specifying substrings or patterns. To replace entire words and n-grams in documents, use the replaceWords and replaceNgrams functions respectively.

example

Examples

collapse all

Replace Substrings in Documents

Replace words in a document array.

documents = tokenizedDocument([ "an extreme example" "another extreme example"])

documents = 2x1 tokenizedDocument:

3 tokens: an extreme example
3 tokens: another extreme example

newDocuments = replace(documents,"example","sentence")

newDocuments = 2x1 tokenizedDocument:

3 tokens: an extreme sentence
3 tokens: another extreme sentence

Replace substrings of the words.

newDocuments = replace(documents,"ex","X-")

newDocuments = 2x1 tokenizedDocument:

3 tokens: an X-treme X-ample
3 tokens: another X-treme X-ample

Replace Substrings in Documents Using Patterns

Remove digits from a document using a digits pattern.

Create an array of tokenized documents.

textData = [ "Text Analytics Toolbox provides over 50 functions to analyze text data." "The bm25Similarity function measures document similarity."]; documents = tokenizedDocument(textData);

Replace instances of consecutive digits with the token "<NUMBER>" using the replace function. Specify a digits pattern using the digitsPattern function.

pat = digitsPattern; newDocuments = replace(documents,pat,"")

newDocuments = 2x1 tokenizedDocument:

12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
 7 tokens: The bm<NUMBER>Similarity function measures document similarity .

Notice that the function replaces the digits in the token "bm25Similarity".

To replace tokens consisting entirely of digits, use the replace function and specify a pattern that also includes text boundaries. Specify text boundaries using the textBoundary function.

pat = textBoundary + digitsPattern + textBoundary; newDocuments = replace(documents,pat,"")

newDocuments = 2x1 tokenizedDocument:

12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
 7 tokens: The bm25Similarity function measures document similarity .

In this case, the function does not replace the digits in the token "bm25Similarity".

Input Arguments

collapse all

documents — Input documents

tokenizedDocument array

Input documents, specified as a tokenizedDocument array.

old — Substring or pattern to replace

string array | character vector | cell array of character vectors | pattern array

Substring or pattern to replace, specified as one of the following:

new — New substring

string array | character vector | cell array of character vectors

New substring, specified as a string array, character vector, or cell array of character vectors.

Data Types: string | char | cell

Version History

Introduced in R2017b