regexp - Match regular expression (case sensitive) - MATLAB (original) (raw)
Match regular expression (case sensitive)
Syntax
Description
[startIndex](#btn%5Fp45%5Fsep%5Fshared-startIndex) = regexp([str](#btn%5Fp45%5Fsep%5Fshared-str),[expression](#btn%5Fp45%5Fsep%5Fshared-expression))
returns the starting index of each substring of str
that matches the character patterns specified by the regular expression. If there are no matches, startIndex
is an empty array. If there are substrings that match overlapping pieces of text, only the index of the first match will be returned.
[[startIndex](#btn%5Fp45%5Fsep%5Fshared-startIndex),[endIndex](#btn%5Fp45%5Fsep%5Fshared-endIndex)] = regexp([str](#btn%5Fp45%5Fsep%5Fshared-str),[expression](#btn%5Fp45%5Fsep%5Fshared-expression))
returns the starting and ending indices of all matches.
[out](#btn%5Fp45%5Fsep%5Fshared-out) = regexp([str](#btn%5Fp45%5Fsep%5Fshared-str),[expression](#btn%5Fp45%5Fsep%5Fshared-expression),[outkey](#btn%5Fp45%5Fsep%5Fshared-outkey))
returns the output specified by outkey
. For example, if outkey
is 'match'
, then regexp
returns the substrings that match the expression rather than their starting indices.
[[out](#btn%5Fp45%5Fsep%5Fshared-out)1,...,[out](#btn%5Fp45%5Fsep%5Fshared-out)N] = regexp([str](#btn%5Fp45%5Fsep%5Fshared-str),[expression](#btn%5Fp45%5Fsep%5Fshared-expression),[outkey](#btn%5Fp45%5Fsep%5Fshared-outkey)1,...,[outkey](#btn%5Fp45%5Fsep%5Fshared-outkey)N)
returns the outputs specified by multiple output keywords, in the specified order. For example, if you specify 'match'
,'tokens'
, then regexp
returns substrings that match the entire expression and tokens that match parts of the expression.
___ = regexp(___,[option](#btn%5Fp45-option)1,...,[option](#btn%5Fp45-option)M)
modifies the search using the specified option flags. For example, specify 'ignorecase'
to perform a case-insensitive match. You can include any of the inputs and request any of the outputs from previous syntaxes.
___ = regexp(___,'forceCellOutput')
returns each output argument as a scalar cell. The cells contain the numeric arrays or substrings that are described as the outputs of the previous syntaxes. You can include any of the inputs and request any of the outputs from previous syntaxes.
Examples
Find words that start with c
, end with t
, and contain one or more vowels between them.
str = 'bat cat can car coat court CUT ct CAT-scan'; expression = 'c[aeiou]+t'; startIndex = regexp(str,expression)
The regular expression 'c[aeiou]+t'
specifies this pattern:
c
must be the first character.c
must be followed by one of the characters inside the brackets,[aeiou]
.- The bracketed pattern must occur one or more times, as indicated by the
+
operator. t
must be the last character, with no characters between the bracketed pattern and thet
.
Values in startIndex
indicate the index of the first character of each word that matches the regular expression. The matching word cat
starts at index 5, and coat
starts at index 17. The words CUT
and CAT
do not match because they are uppercase.
Find the location of capital letters and spaces within character vectors in a cell array.
str = {'Madrid, Spain','Romeo and Juliet','MATLAB is great'}; capExpr = '[A-Z]'; spaceExpr = '\s';
capStartIndex = regexp(str,capExpr); spaceStartIndex = regexp(str,spaceExpr);
capStartIndex
and spaceStartIndex
are cell arrays because the input str
is a cell array.
View the indices for the capital letters.
capStartIndex{1} =
1 9
capStartIndex{2} =
1 11
capStartIndex{3} =
1 2 3 4 5 6
View the indices for the spaces.
celldisp(spaceStartIndex)
spaceStartIndex{1} =
8
spaceStartIndex{2} =
6 10
spaceStartIndex{3} =
7 10
Capture words within a character vector that contain the letter x
.
str = 'EXTRA! The regexp function helps you relax.'; expression = '\wx\w'; matchStr = regexp(str,expression,'match')
matchStr = 1×2 cell {'regexp'} {'relax'}
The regular expression '\w*x\w*'
specifies that the character vector:
- Begins with any number of alphanumeric or underscore characters,
\w*
. - Contains the lowercase letter
x
. - Ends with any number of alphanumeric or underscore characters after the
x
, including none, as indicated by\w*
.
Split a character vector into several substrings, where each substring is delimited by a ^
character.
str = ['Split ^this text into ^several pieces']; expression = '^'; splitStr = regexp(str,expression,'split')
splitStr = 1×3 cell {'Split '} {'this text into '} {'several pieces'}
Because the caret symbol has special meaning in regular expressions, precede it with the escape character, a backslash (\
). To split a character vector at other delimiters, such as a semicolon, you do not need to include the backslash.
Capture parts of a character vector that match a regular expression using the 'match'
keyword, and the remaining parts that do not match using the 'split'
keyword.
str = 'She sells sea shells by the seashore.'; expression = '[Ss]h.'; [match,noMatch] = regexp(str,expression,'match','split')
match = 1×3 cell {'She'} {'she'} {'sho'}
noMatch = 1×4 cell {0×0 char} {' sells sea '} {'lls by the sea'} {'re.'}
The regular expression '[Ss]h.'
specifies that:
S
ors
is the first character.h
is the second character.- The third character can be anything, including a space, as indicated by the dot (
.
).
When the first (or last) character in a character vector matches a regular expression, the first (or last) return value from the 'split'
keyword is an empty character vector.
Optionally, reassemble the original character vector from the substrings.
combinedStr = strjoin(noMatch,match)
combinedStr = 'She sells sea shells by the seashore.'
Find the names of HTML tags by defining a token within a regular expression. Tokens are indicated with parentheses, ()
.
str = 'My Title
Here is some text.
'; expression = '<(\w+).>.</\1>'; [tokens,matches] = regexp(str,expression,'tokens','match');The regular expression <(\w+).*>.*</\1>
specifies this pattern:
<(\w+)
finds an opening angle bracket followed by one or more alphanumeric or underscore characters. Enclosing\w+
in parentheses captures the name of the HTML tag in a token..*>
finds any number of additional characters, such as HTML attributes, and a closing angle bracket.</\1>
finds the end tag corresponding to the first token (indicated by\1
). The end tag has the form</tagname>
.
View the tokens and matching substrings.
tokens{1}{1} =
title
tokens{2}{1} =
p
matches{1} =
My Titlematches{2} =
Here is some text.
Parse dates that can appear with either the day or the month first, in these forms: mm/dd/yyyy
or dd-mm-yyyy
. Use named tokens to identify each part of the date.
str = '01/11/2000 20-02-2020 03/30/2000 16-04-2020'; expression = ['(?\d+)/(?\d+)/(?\d+)|'... '(?\d+)-(?\d+)-(?\d+)']; tokenNames = regexp(str,expression,'names');
The regular expression specifies this pattern:
(?<name>\d+)
finds one or more numeric digits and assigns the result to the token indicated by name.- | is the logical
or
operator, which indicates that there are two possible patterns for dates. In the first pattern, slashes (/
) separate the tokens. In the second pattern, hyphens (-
) separate the tokens.
View the named tokens.
for k = 1:length(tokenNames) disp(tokenNames(k)) end
month: '01'
day: '11'
year: '2000'
month: '02'
day: '20'
year: '2020'
month: '03'
day: '30'
year: '2000'
month: '04'
day: '16'
year: '2020'
Find both uppercase and lowercase instances of a word.
By default, regexp
performs case-sensitive matching.
str = 'A character vector with UPPERCASE and lowercase text.'; expression = '\w*case'; matchStr = regexp(str,expression,'match')
matchStr = 1×1 cell array {'lowercase'}
The regular expression specifies that the character vector:
- Begins with any number of alphanumeric or underscore characters,
\w*
. - Ends with the literal text
case
.
The regexpi
function uses the same syntax as regexp
, but performs case-insensitive matching.
matchWithRegexpi = regexpi(str,expression,'match')
matchWithRegexpi = 1×2 cell {'UPPERCASE'} {'lowercase'}
Alternatively, disable case-sensitive matching for regexp
using the 'ignorecase'
option.
matchWithIgnorecase = regexp(str,expression,'match','ignorecase')
matchWithIgnorecase = 1×2 cell {'UPPERCASE'} {'lowercase'}
For multiple expressions, disable case-sensitive matching for selected expressions using the (?i)
search flag.
expression = {'(?-i)\wcase';... '(?i)\wcase'}; matchStr = regexp(str,expression,'match'); celldisp(matchStr)
matchStr{1}{1} =
lowercase
matchStr{2}{1} =
UPPERCASE
matchStr{2}{2} =
lowercase
Create a character vector that contains a newline, \n
, and parse it using a regular expression. Since regexp
returns matchStr
as a cell array containing text that has multiple lines, you can take the text out of the cell array to display all lines.
str = sprintf('abc\n de'); expression = '.*'; matchStr = regexp(str,expression,'match'); matchStr{:}
By default, the dot (.
) matches every character, including the newline, and returns a single match that is equivalent to the original character vector.
Exclude newline characters from the match using the 'dotexceptnewline'
option. This returns separate matches for each line of text.
matchStrNoNewline = regexp(str,expression,'match','dotexceptnewline')
matchStrNoNewline = 1×2 cell {'abc'} {' de'}
Find the first or last character of each line using the ^
or $
metacharacters and the 'lineanchors'
option.
expression = '.$'; lastInLine = regexp(str,expression,'match','lineanchors')
lastInLine = 1×2 cell {'c'} {'e'}
Find matches within a piece of text and return the output in a scalar cell.
Find words that start with c, end with t, and contain one or more vowels between them. Return the starting indices in a scalar cell.
str = 'bat cat can car coat court CUT ct CAT-scan'; expression = 'c[aeiou]+t'; startIndex = regexp(str,expression,'forceCellOutput')
startIndex = 1×1 cell array {[5 17]}
To access the starting indices as a numeric array, index into the cell.
Return the matching and nonmatching substrings. Each output is in its own scalar cell.
[match,noMatch] = regexp(str,expression,'match','split','forceCellOutput')
match = 1×1 cell array {1×2 cell}
noMatch = 1×1 cell array {1×3 cell}
To access the array of matches, index into match
.
ans = 1×2 cell {'cat'} {'coat'}
To access the substrings that do not match, index into noMatch
.
ans = 1×3 cell {'bat '} {' can car '} {' court CUT ct CAT-scan'}
Input Arguments
Data Types: string
| char
| cell
Data Types: char
| cell
| string
Data Types: char
| string
Search option, specified as a character vector. Options come in pairs: one option that corresponds to the default behavior, and one option that allows you to override the default. Specify only one option from a pair. Options can appear in any order.
Default | Override | Description |
---|---|---|
'all' | 'once' | Match the expression as many times as possible (default), or only once. |
'nowarnings' | 'warnings' | Suppress warnings (default), or display them. |
'matchcase' | 'ignorecase' | Match letter case (default), or ignore case. |
'noemptymatch' | 'emptymatch' | Ignore zero length matches (default), or include them. |
'dotall' | 'dotexceptnewline' | Match dot with any character (default), or all except newline (\n). |
'stringanchors' | 'lineanchors' | Apply ^ and$ metacharacters to the beginning and end of a character vector (default), or to the beginning and end of a line. The newline character (\n) specifies the end of a line. The beginning of a line is specified as the first character, or any character that immediately follows a newline character. |
'literalspacing' | 'freespacing' | Include space characters and comments when matching (default), or ignore them. Withfreespacing, use '\ ' and '\#' to match space and # characters. |
Data Types: char
| string
Output Arguments
More About
Tokens are portions of the matched text that correspond to portions of the regular expression. To create tokens, enclose part of the regular expression in parentheses.
For example, this expression finds a date of the form dd-mmm-yyyy
, including tokens for the day, month, and year.
str = 'Here is a date: 01-Apr-2020'; expression = '(\d+)-(\w+)-(\d+)';
mydate = regexp(str,expression,'tokens'); mydate{:}
ans =
1×3 cell array
{'01'} {'Apr'} {'2020'}
You can associate names with tokens so that they are more easily identifiable:
str = 'Here is a date: 01-Apr-2020'; expression = '(?\d+)-(?\w+)-(?\d+)';
mydate = regexp(str,expression,'names')
mydate =
struct with fields:
day: '01'
month: 'Apr'
year: '2020'
For more information, see Tokens in Regular Expressions.
Tips
- Use contains or strfind to find an exact character match within text. Use
regexp
to look for a pattern of characters.
Algorithms
MATLAB parses each input character vector or string from left to right, attempting to match the text in the character vector or string with the first element of the regular expression. During this process, MATLAB skips over any text that does not match.
When MATLAB finds the first match, it continues parsing to match the second piece of the expression, and so on.
Extended Capabilities
Version History
Introduced before R2006a