replace - Replace substrings in documents - MATLAB (original) (raw)
Main Content
Replace substrings in documents
Syntax
Description
[newDocuments](#d126e42691) = replace([documents](#d126e42603),[old](#d126e42619),[new](#d126e42657))
replaces all occurrences of the substring or pattern old
indocuments
with new
.
Tip
Use the replace
function to replace substrings of the words in documents by specifying substrings or patterns. To replace entire words and n-grams in documents, use the replaceWords and replaceNgrams functions respectively.
Examples
Replace Substrings in Documents
Replace words in a document array.
documents = tokenizedDocument([ "an extreme example" "another extreme example"])
documents = 2x1 tokenizedDocument:
3 tokens: an extreme example
3 tokens: another extreme example
newDocuments = replace(documents,"example","sentence")
newDocuments = 2x1 tokenizedDocument:
3 tokens: an extreme sentence
3 tokens: another extreme sentence
Replace substrings of the words.
newDocuments = replace(documents,"ex","X-")
newDocuments = 2x1 tokenizedDocument:
3 tokens: an X-treme X-ample
3 tokens: another X-treme X-ample
Replace Substrings in Documents Using Patterns
Remove digits from a document using a digits pattern.
Create an array of tokenized documents.
textData = [ "Text Analytics Toolbox provides over 50 functions to analyze text data." "The bm25Similarity function measures document similarity."]; documents = tokenizedDocument(textData);
Replace instances of consecutive digits with the token "<NUMBER>"
using the replace
function. Specify a digits pattern using the digitsPattern
function.
pat = digitsPattern; newDocuments = replace(documents,pat,"")
newDocuments = 2x1 tokenizedDocument:
12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
7 tokens: The bm<NUMBER>Similarity function measures document similarity .
Notice that the function replaces the digits in the token "bm25Similarity"
.
To replace tokens consisting entirely of digits, use the replace
function and specify a pattern that also includes text boundaries. Specify text boundaries using the textBoundary
function.
pat = textBoundary + digitsPattern + textBoundary; newDocuments = replace(documents,pat,"")
newDocuments = 2x1 tokenizedDocument:
12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
7 tokens: The bm25Similarity function measures document similarity .
In this case, the function does not replace the digits in the token "bm25Similarity"
.
Input Arguments
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument array.
old
— Substring or pattern to replace
string array | character vector | cell array of character vectors | pattern
array
Substring or pattern to replace, specified as one of the following:
- String array
- Character vector
- Cell array of character vectors
- pattern array
new
— New substring
string array | character vector | cell array of character vectors
New substring, specified as a string array, character vector, or cell array of character vectors.
Data Types: string
| char
| cell
Version History
Introduced in R2017b