EmacsWiki: Regular Expression (original) (raw)

A regular expression (abbreviated “**regexp” or sometimes just “re**”) is a search-string with wildcards – and more. It is a pattern that is matched against the text to be searched. See Regexps. Examples:

"alex"

A plain string is a regular expression that matches the string exactly. The above regular expression matches “alex”.

"alexa?"

Some characters have special meanings in a regular expression. The question mark, for example, says that the preceding expression (the character “a” in this case) may or may not be present. The above regular expression matches “alex” or “alexa”.

Regexps are important to Emacs users in many ways, including these:

Conventionally, Emacs allows only regular expressions formatted as strings and that is still the case for interactive use, such as ‘M-x’ (command query-replace-regexp). However, when writing Lisp code, one can now use the easier to understand Rx Notation, which structures a regular expression as Lisp S-expressions. For example, you can use ‘rx’ like this:

(rx (or (and "\*" (*? anything) "*/") (and "//" (*? anything) eol)))

To produce this regexp in string format (which matches C-style multiline and single line comments):

\\*\\(?:.\\|\n\\)*?\\*/\\|//\\(?:.\\|\n\\)*?$

When used in InteractiveLispMode, Rx can also be a handy tool to construct regexps. There’s also a package xr (available on GNU ELPA) that does the opposite conversion, from string syntax regexps to something more human-readable.

Regexp Syntax Basics

Any character matches itself, except for the list below.

The following characters are special : . * + ? ^ $ \ [

Between brackets [], the following are special : ] - ^

Many characters are special when they follow a backslash – see below.

. any character (but newline)

? previous character or group, repeated 0 or 1 time ^ start of line $ end of line [...] any character between brackets [^..] any character not in the brackets [a-z] any character between a and z \ prevents interpretation of following special char | or \w word constituent \b word boundary \sc character with c syntax (e.g. \s- for whitespace char) ( ) start/end of group &lt _< _> start/end of symbol ` ' start/end of buffer/string \1 string matched by the first group \n string matched by the nth group {3} previous character or group, repeated 3 times {3,} previous character or group, repeated 3 or more times {3,6} previous character or group, repeated 3 to 6 times = match succeeds if it is located at point

*?, +?, and ?? are non-greedy versions of *, +, and ? – see NonGreedyRegexp. Also, \W, \B, and \Sc match any character that does not match \w, \b, and \sc.

Character Groups

Characters are organized by category. Use C-u C-x = to display the category of the character under the cursor.

\ca ascii character \Ca non-ascii character (newline included) \cl latin character \cg greek character

Syntax classes provide a shorthand for representing some groups of related characters, hence they are also known as character classes. Syntax classes can themselves be represented in a square-bracket-delimited, and they must also be used between square brackets (more below). You can view all syntax classes here. Some of the more useful square-bracket-delimited syntax classes are:

[:digit:] a digit, same as [0-9] [:alpha:] a letter (an alphabetic character) [:alnum:] a letter or a digit (an alphanumeric character) [:upper:] a letter in uppercase [:lower:] a letter in lowercase [:graph:] a visible character [:print:] a visible character plus the space character [:space:] a whitespace character, as defined by the syntax table, but typically [ \t\r\n\v\f ], which includes the newline character [:blank:] a space or tab character [:xdigit:] an hexadecimal digit [:cntrl:] a control character [:ascii:] an ascii character [:nonascii:] any non ascii character

Syntax classes can also be represented using backslashes: e.g.,

\s- whitespace character \s/ character quote character \sw word constituent \s$ paired delimiter \s_ symbol constituent \s' expression prefix \s. punctuation character \s< comment starter \s( open delimiter character \s> comment ender \s) close delimiter character \s! generic comment delimiter \s" string quote character \s| generic string delimiter \s\ escape character

You can see the current syntax table by typing C-h s. The syntax table depends on the current mode. As expected, letters a..z are listed as word constituents in text-mode. Other word constituents in this mode include A..Z, 0..9, $, %, currency units, accented letters, kanjis. See EmacsSyntaxTable for details.

Syntax Class Usage

One must remember that syntax classes must be used within square brackets, as one would in specifying any other sequence to match or reject: e.g. [[:upper:]\|[:digit:]\.]. This is easy to overlook when using square-bracket-delimited syntax classes, since they include their own square brackets! E.g., if searching for a single whitespace character in a regexp, this

[:space:]

will fail. Just remember to add a 2nd/outer pair of square brackets: e.g., this

[[:space:]]

will succeed in finding a single whitespace character, this

[[:space:]]*

will succeed in finding zero-or-more single whitespace characters, etc.

Idiosyncrasies of Emacs Regular Expressions

Some Regexp Examples

[-+[:digit:]] digit or + or - sign (+|-)?[0-9]+(.[0-9]+)? decimal number (-2 or 1.5 but not .2 or 1.) &lt &lt +$ trailing whitespaces (note the starting SPC) \w{20,} word with 20 letters or more \w+phony&gt (19|20)[0-9]{2} year 1900-2099 ^.{6,} at least 6 symbols ^[a-zA-Z0-9_]{3,16}$ decent string for a user name <tag[^> C-q C-j ]>(.?) html tag

Some Emacs Commands that Use Regular Expressions

C-M-s incremental forward search matching regexp C-M-r incremental backward search matching regexp replace-regexp replace string matching regexp query-replace-regexp same, but query before each replacement align-regexp align, using strings matching regexp as delimiters highlight-regexp highlight strings matching regexp occur show lines containing a match multi-occur show lines in all buffers containing a match how-many count the number of strings matching regexp keep-lines delete all lines except those containing matches flush-lines delete lines containing matches grep call unix grep command and put result in a buffer lgrep user-friendly interface to the grep command rgrep recursive grep dired-do-copy-regexp copy files with names matching regexp dired-do-rename-regexp rename files matching regexp find-grep-dired display files containing matches for regexp with Dired

Note that list-matching-lines is an alias for occur and delete-matching-lines is an alias for flush-lines. The command highlight-regexp is bound to C-x w h. Also query-replace-regexp is bound by default to C-M-%, although some people prefer using an alias, like M-x qrr. Put the following in your InitFile to create such alias.

(defalias 'qrr 'query-replace-regexp)

See also: IncrementalSearch, ReplaceRegexp, AlignCommands, OccurBuffer, DiredPower

Tools for Constructing Regexps

Study and Practice

Use Icicles to Learn about Regexps

Icicles provides these interactive ways to learn about regexps:

IciclesSearchContextLevelsScreenshot

Questions

Does Emacs support lookahead/lookbehind?

No, Emacs does not support Perl-style lookahead/lookbehind expressions.

Note: If you have Python installed, you can add the package visual regexp steroids, to use python based regular expressions. Interactive regular expressions (such as isearch-forward-regexp and regexp-replace) written using this package do support lookahead and lookbehind expressions.

Does Emacs support possessive quantifiers such as ?+, *+, ++ ?

No, Emacs does not support possessive quantifiers.

The escape sequence \c_C_ represents any character of category “_C_”, and according to Emacs documentation invoked by “M-x describe-categories”, \c6 ought to match any digit. Yet the ASCII digits 0-9 are not matched by \c6. Is this an error, or just something on the to-do list?


CategoryRegexp CategoryGlossary