pattern - Patterns to search and match text - MATLAB (original) (raw)

Patterns to search and match text

Since R2020b

Description

A pattern defines rules for matching text with text-searching functions like contains, matches, andextract. You can build a pattern expression using pattern functions, operators, and literal text. For example, MATLAB® release names, start with "R", followed by the four-digit year, and then either "a" or "b". Define a pattern to match the format of the release names:

pat = "R" + digitsPattern(4) + ("a"|"b");

Match that pattern in a string:

str = ["String was introduced in R2016b." "Pattern was added in R2020b."]; extract(str,pat)

ans = 2x1 string array "R2016b" "R2020b"

Creation

Patterns are composed of literal text and other patterns using the +,|, and ~ operators. You also can create common patterns using Object Functions, which use rules often associated with regular expressions:

The function pattern also creates pattern functions with the syntax,pat = pattern(txt), where txt is literal text thatpat matches. Pattern functions are useful for specifying pattern type for function argument validation. However, the pattern function is rarely needed for other cases because MATLAB text-matching functions accept text inputs.

Object Functions

expand all

Search Text

contains Determine if pattern is in strings
matches Determine if pattern matches strings
count Count occurrences of pattern in strings
endsWith Determine if strings end with pattern
startsWith Determine if strings start with pattern

Edit Text

Character-Matching Patterns

Search Rule Patterns

Boundary Patterns

Regular Expression Patterns

Pattern Organization

Examples

collapse all

Search Text Using Patterns

lettersPattern is a typical character-matching pattern that matches letter characters. Create a pattern that matches one or more letter characters.

txt = ["This" "is a" "1x6" "string" "array" "."]; pat = lettersPattern;

Use contains to determine if characters matched by pat are present in each string. The output logical array shows that the first five of the strings in txt contain letters, but the sixth string does not.

ans = 1x6 logical array

1 1 1 1 1 0

Determine if text starts with the specified pattern. The output logical array shows that four of the strings in txt start with letters, but two strings do not.

ans = 1x6 logical array

1 1 0 1 1 0

Determine if the string fully matches the specified pattern. The output logical array shows which of the strings in txt contain nothing but letters.

ans = 1x6 logical array

1 0 0 1 1 0

Count the number of times a pattern matched. The output numerical array shows how many times lettersPattern matched in each element of txt. Note that lettersPattern matches one or more letters so a group of concurrent letters is a single match.

Edit Text Using Patterns

digitsPattern is a typical character-matching pattern that matches digit characters. Create a pattern that matches digit characters.

txt = ["1 fish" "2 fish" "[1,0,0] fish" "[0,0,1] fish"]; pat = digitsPattern;

Use replace to edit pieces of text that match the pattern.

ans = 1x4 string "# fish" "# fish" "[#,#,#] fish" "[#,#,#] fish"

Create a new piece of text by inserting an "!" character after matched letters.

ans = 1x4 string "1! fish" "2! fish" "[1!,0!,0!] fish" "[0!,0!,1!] fish"

Patterns can be created using the OR operator, |, with text. Erase text matched by the specified pattern.

txt = erase(txt,"," | "]" | "[")

txt = 1x4 string "1 fish" "2 fish" "100 fish" "001 fish"

Extract pat from the new text.

ans = 1x4 string "1" "2" "100" "001"

Count Characters in Text

Use patterns to count the occurrences of individual characters in a piece of text.

txt = "She sells sea shells by the sea shore.";

Create pat as a pattern object that matches individual letters using alphanumericsPattern. Extract the pattern.

pat = alphanumericsPattern(1); letters = extract(txt,pat);

Display a histogram of the number of occurrences of each letter.

letters = lower(letters); letters = categorical(letters); histogram(letters)

Figure contains an axes object. The axes object contains an object of type categoricalhistogram.

Hide Details When Displaying Complicated Patterns

Use maskedPattern to display a variable in place of a complicated pattern expression.

Build a pattern that matches simple arithmetic expressions composed of numbers and arithmetic operators.

mathSymbols = asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1)

mathSymbols = pattern Matching:

asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1)

Build a pattern that matches arithmetic expressions with whitespaces between characters using mathSymbols.

longExpressionPat = asManyOfPattern(mathSymbols + whitespacePattern) + mathSymbols

longExpressionPat = pattern Matching:

asManyOfPattern(asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1) + whitespacePattern) + asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1)

The displayed pattern expression is long and difficult to read. Use maskedPattern to display the variable name, mathSymbols, in place of the pattern expression.

mathSymbols = maskedPattern(mathSymbols); shortExpressionPat = asManyOfPattern(mathSymbols + whitespacePattern) + mathSymbols

shortExpressionPat = pattern Matching:

asManyOfPattern(mathSymbols + whitespacePattern) + mathSymbols

Use details to show more information

Create a string containing some arithmetic expressions, and then extract the pattern from the text.

txt = "What is the answer to 1 + 1? Oh, I know! 1 + 1 = 2!"; arithmetic = extract(txt,shortExpressionPat)

arithmetic = 2x1 string "1 + 1" "1 + 1 = 2"

Specify Names and Descriptions for Complicated Patterns

Create a pattern from two named patterns. Naming patterns adds context to the display of the pattern.

Build two patterns: one that matches words that begin and end with the letter D, and one that matches words that begin and end with the letter R.

dWordsPat = letterBoundary + caseInsensitivePattern("d" + lettersPattern + "d") + letterBoundary; rWordsPat = letterBoundary + caseInsensitivePattern("r" + lettersPattern + "r") + letterBoundary;

Build a pattern using the named patterns that finds a word that starts and ends with D followed by a word that starts and ends with R.

dAndRWordsPat = dWordsPat + whitespacePattern + rWordsPat

dAndRWordsPat = pattern Matching:

letterBoundary + caseInsensitivePattern("d" + lettersPattern + "d") + letterBoundary + whitespacePattern + letterBoundary + caseInsensitivePattern("r" + lettersPattern + "r") + letterBoundary

This pattern is hard to read and does not convey much information about its purpose. Use namedPattern to designate the patterns as named patterns that display specified names and descriptions in place of the pattern expressions.

dWordsPat = namedPattern(dWordsPat,"dWords", "Words that start and end with D"); rWordsPat = namedPattern(rWordsPat,"rWords", "Words that start and end with R"); dAndRWordsPat = dWordsPat + whitespacePattern + rWordsPat

dAndRWordsPat = pattern Matching:

dWords + whitespacePattern + rWords

Using named patterns:

dWords: Words that start and end with D
rWords: Words that start and end with R

Use details to show more information

Create a string and extract the text that matches the pattern.

txt = "Dad, look at the divided river!"; words = extract(txt,dAndRWordsPat)

Match Email Addresses

Build an easy to read pattern to match email addresses.

Email addresses follow the structure username@domain.TLD, where username and domain are made up of identifiers separated by periods. Build a pattern that matches identifiers composed of any combination of alphanumeric characters and "_" characters. Use maskedPattern to name this pattern identifier.

identifier = asManyOfPattern(alphanumericsPattern(1) | "_", 1); identifier = maskedPattern(identifier);

Build patterns to match domains and subdomains comprised of identifiers. Create a pattern that matches TLDs from a specified list.

subdomain = asManyOfPattern(identifier + ".") + identifier; domainName = namedPattern(identifier,"domainName"); tld = "com" | "org" | "gov" | "net" | "edu";

Build a pattern for matching the local part of an email, which matches one or more identifiers separated by periods. Build a pattern for matching the domain, TLD, and any potential subdomains by combining the previously defined patterns. Use namedPattern to assign each of these patterns to a named pattern.

username = asManyOfPattern(identifier + ".") + identifier; domain = optionalPattern(namedPattern(subdomain) + ".") + ... domainName + "." + ... namedPattern(tld);

Combine all of the patterns into a single pattern expression. Use namedPattern to assign username, domain, and emailPattern to named patterns.

emailAddress = namedPattern(username) + "@" + namedPattern(domain); emailPattern = namedPattern(emailAddress)

emailPattern = pattern Matching emailAddress:

username + "@" + domain

Using named patterns:

emailAddress  : username + "@" + domain
  username    : asManyOfPattern(identifier + ".") + identifier
  domain      : optionalPattern(subdomain + ".") + domainName + "." + tld
    subdomain : asManyOfPattern(identifier + ".") + identifier
    domainName: identifier
    tld       : "com" | "org" | "gov" | "net" | "edu"

Use details to show more information

Create a string that contains an email address, and then extract the pattern from the text.

txt = "You can reach me by email at John.Smith@department.organization.org"; extract(txt,emailPattern)

ans = "John.Smith@department.organization.org"

Named patterns allow dot-indexing in order to access named subpatterns. Use dot-indexing to assign a specific value to the named pattern domain.

emailPattern.emailAddress.domain = "mathworks.com"

emailPattern = pattern Matching emailAddress:

username + "@" + domain

Using named patterns:

emailAddress: username + "@" + domain
  username  : asManyOfPattern(identifier + ".") + identifier
  domain    : "mathworks.com"

Use details to show more information

Extended Capabilities

Thread-Based Environment

Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

Version History

Introduced in R2020b