erasePunctuation - Erase punctuation from text and documents - MATLAB (original) (raw)
Erase punctuation from text and documents
Syntax
Description
[newStr](#d126e22312) = erasePunctuation([str](#d126e22158))
erases punctuation and symbols from the elements of str
. The function removes characters that belong to the Unicode punctuation or symbol classes.
[newDocuments](#d126e22338) = erasePunctuation([documents](#d126e22191))
erases punctuation and symbols from documents
. If a word is empty after removing punctuation and symbol characters, then the function removes it. For tokenized document input, the function erases punctuation from tokens with type 'punctuation'
and 'other'
. For example, the function does not erase punctuation and symbol characters from URLs and email addresses.
[newDocuments](#d126e22338) = erasePunctuation([documents](#d126e22191),'TokenTypes',[types](#mw%5Fc3fca0f4-930c-4dcf-95b4-335fed218bd5))
erases punctuation and symbols from only the specified token types.
Examples
Erase the punctuation from the text in str
.
str = "it's one and/or two."; newStr = erasePunctuation(str)
newStr = "its one andor two"
To insert a space where the "/"
symbol is, first use the replace
function.
newStr = replace(str,"/"," ")
newStr = "it's one and or two."
newStr = erasePunctuation(newStr)
newStr = "its one and or two"
Erase the punctuation from an array of documents.
documents = tokenizedDocument([ ... "An example of a short sentence." "Another example... with a URL: https://www.mathworks.com"])
documents = 2×1 tokenizedDocument:
7 tokens: An example of a short sentence .
10 tokens: Another example . . . with a URL : https://www.mathworks.com
newDocuments = erasePunctuation(documents)
newDocuments = 2×1 tokenizedDocument:
6 tokens: An example of a short sentence
6 tokens: Another example with a URL https://www.mathworks.com
Here, the function does not erase the punctuation symbols from the URL.
Input Arguments
Input text, specified as a string array, character vector, or cell array of character vectors.
Example: ["An example of a short sentence."; "A second short sentence."]
Data Types: string
| char
| cell
Token types to erase punctuation from, specified as a character vector, string array, or a cell array of character vectors containing one or more token types (including custom token types).
The tokenizedDocument and addTypeDetails functions automatically detect the following token types:
letters
— string of letter characters onlydigits
— string of digits onlypunctuation
— string of punctuation and symbol characters onlyemail-address
— detected email addressweb-address
— detected web addresshashtag
— detected hashtag (starts with"#"
character followed by a letter)at-mention
— detected at-mention (starts with"@"
character, followed by 1 to 15 ASCII letter, digit, or underscore characters)emoticon
— detected emoticonemoji
— detected emojiother
— does not belong to the previous types and is not a custom type
To specify your own custom token types when tokenizing, use the 'CustomTokens' or 'RegularExpressions' options in tokenizedDocument. If you do not specify a type for a custom token, then the software sets the corresponding token type to'custom'
.
Data Types: string
| char
| cell
Output Arguments
Output text, returned as a string array, character vector, or cell array of character vectors. str and newStr
have the same data type.
More About
Each Unicode character is assigned a category. The following table summarizes the Unicode punctuation and symbol categories and provides an example character from each category:
Category | Category Code | Number of Characters | Example Character |
---|---|---|---|
Punctuation, Connector | [Pc] | 10 | _ |
Punctuation, Dash | [Pd] | 24 | - |
Punctuation, Close | [Pe] | 73 | ) |
Punctuation, Final quote | [Pf] | 10 | ” |
Punctuation, Initial quote | [Pi] | 12 | “ |
Punctuation, Other | [Po] | 566 | ! |
Punctuation, Open | [Ps] | 75 | ( |
Symbol, Currency | [Sc] | 54 | $ |
Symbol, Modifier | [Sk] | 121 | ^ |
Symbol, Math | [Sm] | 948 | + |
Symbol, Other | [So] | 5855 | ¦ |
For more information, see [1].
Tips
- For string input,
erasePunctuation
removes punctuation characters from URLs and HTML tags. This behavior can prevent the functionseraseTags, eraseURLs, and decodeHTMLEntities from working as expected. If you want to use these functions to preprocess your text, then use these functions before usingerasePunctuation
.
References
Version History
Introduced in R2017b
Starting in R2018b, for tokenizedDocument
input,erasePunctuation
, by default, erases punctuation and symbol characters from tokens with type 'punctuation'
or'other'
only. This behavior prevents the function from affecting complex tokens such as URLs and email-addresses.
In previous versions, erasePunctuation
erases punctuation characters from all tokens. To reproduce the behavior, use the'TokenTypes'
name-value pair.