GNU BRE/ERE cheatsheet and differences between grep, sed and awk (original) (raw)

GNU BRE/ERE cheatsheet

Poster created using Canva

This post covers Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE) syntax supported by GNU grep, sed and awk. You'll also learn the differences between these tools — for example, awk doesn't support backreferences within regexp definition (i.e. the search portion).

info From GNU grep manual:

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

grep and sed support BRE by default and enables ERE when -E option is used. awk supports only ERE. Assume ERE for descriptions in this post unless otherwise mentioned.

This post is intended as a reference for BRE/ERE flavor of regular expressions. For a more detailed explanation with examples and exercises, see these chapters from my ebooks:

Anchors🔗

Pattern Description
^ restricts the match to the start of the string
$ restricts the match to the end of the string
\< restricts the match to the start of word
\> restricts the match to the end of word

The -x cli option in grep is equivalent to ^pattern$.

Word characters include alphabets, digits and underscore. Here's some more alternate ways to specify word anchors:

Pattern Description
\b restricts the match to the start/end of words, applicable for grep and sed
\y restricts the match to the start/end of words, applicable for awk (\b means backspace)
\B matches wherever \b (or \y) doesn't match

grep also supports -w cli option. It is equivalent to (?<!\w)pattern(?!\w). The three different ways to specify word anchors are not exactly equivalent though, see Word boundary differences section from my book for details and examples.

Alternation and Grouping🔗

Pattern Description
pat1|pat2 pat3
use \| in BRE mode
() group pattern(s), a(b|c)d is same as abd
use \(\) in BRE mode

The alternative patterns can have their own independent anchors. Alternative which matches earliest in the input gets precedence. Longest matching portion wins if multiple alternatives start from the same location (irrespective of the order of alternatives). In case of a tie with same lengths, leftmost alternative wins (see stackoverflow: Non greedy matching in sed for a practical use case).

Pattern Description
\ prefix metacharacters with \ to match them literally
\\ to match \ literally
Pattern Description
. match any character, including the newline character
? match 0 or 1 times
use \? in BRE mode
* match 0 or more times
+ match 1 or more times
use \+ in BRE mode
{m,n} match m to n times
{m,} match at least m times
{,n} match up to n times (including 0 times)
{n} match exactly n times
use \{\} in BRE mode
pat1.*pat2 any number of characters between pat1 and pat2
pat1.*pat2|pat2.*pat1 match both pat1 and pat2 in any order

Precedence rule is longest match wins, which is mostly similar but not exactly same as greedy quantifiers. For example, with foo123312baz as input string, o[123]+(12baz)? will match o123312baz with these tools, whereas it will match o123312 with greedy quantifiers.

Character class🔗

Pattern Description
[set123] match any of these characters once
[^set123] match except any of these characters once
[3-7AM-X] range of characters from 3 to 7, A, another range from M to X
[. open collating symbol
.] close collating symbol
[= open equivalence class
=] close equivalence class

Specific placement will help to match character class metacharacters literally.

Pattern Description
[a-z-] - should be first/last character to match literally
[+^] ^ shouldn't be first character
[]=] ] should be first character (second if ^ is used to invert the set)

Some commonly used character sets have predefined escape sequences:

Pattern Description
\w similar to [a-zA-Z0-9_] for matching word characters
\s similar to [ \t\n\r\f\v] for matching whitespace characters
\W match non-word characters
\S match non-whitespace characters

Escape sequences🔗

This section is applicable only for sed and awk unless otherwise specified and can be used within character classes too. See also ASCII Codes Table Standard characters.

Escape sequence Description
\a alert
\b backspace in awk, word boundary in grep and sed
\b inside a character class in sed will act as a backspace
\f formfeed
\n newline
\r carriage return
\t horizontal tab
\v vertical tab
\cx CONTROL-x in sed

You can also represent ASCII characters using their codepoint values.

Escape sequence Description
\xNN hexadecimal digits
\NNN octal digits in awk
\oNNN octal digits in sed
\dNNN decimal digits in sed

info Ways to use escape sequences with grep:

Named character sets🔗

The below table lists named sets and their equivalent character class in ASCII encoding. These can be used inside character classes only. For example, [[:digit:]] is same as [0-9] and [[:alnum:]_] is equivalent to \w.

Named set Description
[:digit:] [0-9]
[:lower:] [a-z]
[:upper:] [A-Z]
[:alpha:] [a-zA-Z]
[:alnum:] [0-9a-zA-Z]
[:xdigit:] [0-9a-fA-F]
[:cntrl:] control characters — first 32 ASCII characters and 127th (DEL)
[:punct:] all the punctuation characters
[:graph:] [:alnum:] and [:punct:]
[:print:] [:alnum:], [:punct:] and space
[:blank:] space and tab characters
[:space:] whitespace characters, same as \s

info From grep manual:

Their interpretation depends on the LC_CTYPE locale; for example, [[:alnum:]] means the character class of numbers and letters in the current locale.

Backreferences🔗

Pattern Description
\N backreference, gives matched portion of Nth capture group
possible values: \1, \2 up to \9
& represents entire matched string in the replacement section
\0 equivalent to & in sed

Notes for awk:

sed flags🔗

This section discusses flags (also known as modifiers) that change the regexp behavior. When used with regexp addressing:

Flag Description
I match case insensitively

When used with substitution command:

Flag Description
i or I match case insensitively
g replace all occurrences instead of just the first match
N a number will cause only the _N_th match to be replaced
Ng replace from _N_th match to the end
m or M multiline mode
. will not match the newline character
^ and $ will match every line's start and end locations (line separator is \n by default and NUL when -z option is used)
\` always match the start of string irrespective of multiline mode
\' always match the end of string irrespective of multiline mode

Flags are not supported by grep or awk. But these equivalent/alternative options can be used:

The behavior of sed and awk differs for _N_th match if the pattern can match empty string:

$ echo 'a,,c,d,,f' | sed 's/[^,]*/b/2'
a,b,c,d,,f
$ echo 'a,,c,d,,f' | sed 's/[^,]*/e/5'
a,,c,d,e,f

$ echo 'a,,c,d,,f' | awk '{print gensub(/[^,]*/, "b", 2)}'
ab,,c,d,,f
$ echo 'a,,c,d,,f' | awk '{print gensub(/[^,]*/, "e", 5)}'
a,,ce,d,,f

sed case conversion🔗

Escape sequence Description
\E indicates end of case conversion in replacement section
\l convert next character to lowercase
\u convert next character to uppercase
\L convert following characters to lowercase, stops if \U or \E is found
\U convert following characters to uppercase, stops if \L or \E is found

sed delimiters🔗