GitHub - digitalwave/msc_retest at pcre2support-jit-without-stack (original) (raw)

msc_retest

Welcome to the msc_retest documentation.

Description

This tool compiles two binaries: pcre4msc2 and pcre4msc3. The binaries emulates the behaviors of regex engine (PCRE - the old version) in mod_security2 (Apache module) and the libmodsecurity3. With this programs, you can check the evaluation time and result of every regular expressions with any random (including very extreme long) input. Both of them (regex pattern, input subject) needs to exists in two separated files, and you can pass them as argument. Subject can be passed through stdin, if you give '-' for subject file, eg:

echo "arg=../../../etc/passwd&foo=var" | src/pcre4msc2 regexes/930110_1.txt - regexes/930110_1.txt - time elapsed: 0.000011753, match value: SUBJECT MATCHED 1 TIME

or just simple leave it:

echo "arg=../../../etc/passwd&foo=var" | src/pcre4msc3 regexes/930110_1.txt regexes/930110_1.txt - time elapsed: 0.000006, match value: SUBJECT MATCHED 1 TIME

The source tree contains some extra directories: under the data/ you can find all of the regular expressions, what CRS uses. The files contains the id of the rule, and a suffix (there are some chained rules, where more parts uses @rx).

The utils/ directory contains a python tool, which helps to re-generate the regex's above. It uses msc_pyparser - for more information, see the README.md in that directory.

Useful information

There are so many difference between mod_security2 and libmodsecurity3 implementations:

Parsing:

This mean if the rule has been read from the external file (like CRS), that will not be stripped, but if there is an inline rule (see this), then it will.

This strip function is implemented in pcre4msc2 code itself.

pcre_study(), JIT

limit match, limit match recursion

Here you can find more information about pcre_study, and pcrejit. Also you can find some useful information about recursion limits here.

Capture patterns with OVECTOR

If pcre_exec() matches a regular expression against a given subject string, then it collects the sub-patterns. The offsets values will stored in a vector of integers - this is the OVECTOR. It's very important, that the length if this vector must be divide by 3 (the correct values could be multiple of 3). It's important how long of this vector, this value determines the number of possible sub-patterns.

Both versions uses a precompiled value, mod_security2 uses 30, libmodsecurity3 uses 900. You can change these values only if you modify the source and recompile it again.

These information are very important to understand, why and how works the tools.

Other notes - outdated

The libmodsecurity3 @rx implementation had some design errors (which are now fixed): the searchAll() method (which is same here as the original code) always collected the captured sub-strings from subject - this could be make slower the operator. The another thing was there was no limit: if the subject contained the pattern (for example) 100 times, then it collected all of them, so this could be lead to high memory usage (of course it depends on pattern and subject - but the collection will stores only the first 10 matches: from TX.0 to TX.9). Another problem was that this method was used for many other places, not just the @rx operator, eg. variables, transformations... pcre4msc3 contains the fixed version, but you can allow the old method with -f argument. This argument works only with pcre4msc3 tool.

In case of mod_security2, the engine doesn't collects all matches, only the first one.

Here is a small summary, how can you check:

SecRule ARGS|ARGS_NAMES "@rx (?:is)" \  
  "id:800003,\  
  phase:2,\  
  t:none,\  
  pass,\  
  capture,\  
  msg:'%{TX.0}, %{TX.1}, %{TX.2}, %{TX.3}, %{TX.4}, %{TX.5}, %{TX.6}, %{TX.7}, %{TX.8}, %{TX.9}'"  
curl -v 'http://localhost/?foo=this%20is%20what%20is%20this  

note, the argument foo contains the value 'this is what is this', and the pattern in the rule is 'is'. The number of expected values is 4.

Let's see the logs:

ModSecurity: Warning. Pattern match "(?:is)" at ARGS:foo. [file "/usr/share/modsecurity-crs/rules/REQUEST-901-INITIALIZATION.conf"] [line "456"] [id "800003"] [msg "is, , , , , , , , , "] [hostname "localhost"] [uri "/"] [unique_id "XonUavaOEga8L3onA0ZYBAAAAAA"]  
ModSecurity: Warning. Matched "Operator `Rx' with parameter `(?:is)' against variable `ARGS:foo' (Value: `this is what is this' ) [file "/usr/share/modsecurity-crs/rules/REQUEST-901-INITIALIZATION.conf"] [line "447"] [id "800003"] [rev ""] [msg "is, is, is, is, , , , , , "] [data ""] [severity "0"] [ver ""] [maturity "0"] [accuracy "0"] [hostname "0.0.0.0"] [uri "/"] [unique_id "158609110136.080640"] [ref "o2,2o5,2o13,2o18,2v10,20"], client: ::1, server: _, request: "GET /?foo=this%20is%20what%20is%20this HTTP/1.1", host: "localhost"  

As you can see, the libmodsecurity3 produced the expected result, mod_security2 doesn't. Try to increase the number if "is" patterns in your query, eg:

curl -v 'http://localhost/?foo=this%20is%20what%20is%20this%20is%20is%20is...%20is'

and check the modsec_debug.log. As you can see the result is what I described above, libmodsecurity3 old version collects all occurrence of matches:

Added regex subexpression TX.0: is
Added regex subexpression TX.1: is
Added regex subexpression TX.2: is
Added regex subexpression TX.3: is
Added regex subexpression TX.4: is
Added regex subexpression TX.5: is
Added regex subexpression TX.6: is
Added regex subexpression TX.7: is
Added regex subexpression TX.8: is
Added regex subexpression TX.9: is
Added regex subexpression TX.10: is
Added regex subexpression TX.11: is
Added regex subexpression TX.12: is

but only the first 10 has logged. In case of mod_security2, you will see only one line:

Added regex subexpression to TX.0: is

Requirements

To compile the code you need the pcre library (eg. libpcre3-dev package on Debian - this is the old version), and the autotools. libmodsecurity3 supports pcre2 too, so pcre4msc3 supports pcre2 engine too, if libpcre2-dev is available.

Build

Download and unpack the source, and

autoreconf --install
./configure
make

To disable pcre2 support for pcre4msc3, pass the argument --without-pcre2 to ./configure script.

Use

A quick how to:

echo "this is what is this" > subject.txt echo "(?:is)" > pattern.txt src/pcre4msc2 pattern.txt subject.txt pattern.txt - time elapsed: 0.000017264, match value: SUBJECT MATCHED 1 TIME

This means the subject matched with the regex, and the time taken 0.000017264 sec (17.264 usec = 17 264 nanosecs).

Note: all measured time are in nanoseconds.

You can check the exit status of program:

Now check the elapsed time in the output - that's a very important value. That shows you how long the regex engine runs. You can pass an argument to check that this time below your limit or not. For this, try to run:

src/pcre4msc2 -t 0.000001 pattern.txt subject.txt pattern.txt - time elapsed: 0.000016613, match value: SUBJECT MATCHED 1 TIME echo $? 1

As you can see, the time limit (argument of -t) is 0.000001 second (1 ms), it's less than the running time. If this is greater than your limit, the exit codes will 1. Try it again with a greater value:

src/pcre4msc2 -t 0.0001 pattern.txt subject.txt pattern.txt - time elapsed: 0.000017574, match value: SUBJECT MATCHED 1 TIME echo $? 0

src/pcre4msc3 works as same way.

Extra options

Based on the information above, now let's see the possible options for the tools.

option parameter needed meaning pcre4msc2 pcre4msc3
-j no use jit supported not supported
-s no ignore study supported not supported
-n yes number of runs supported supported
-m yes match limit supported not supported
-r yes match lim. rec. supported not supported
-t yes exec. time lim. supported supported
-d no show details supported supported
-f no force alt. meth. not supported supported
-2 no use pcre2 if av. not supported supported

and -h of course.

Examples

Here are some examples:

(Note, that I assumed you have a file with the content for subject.)

Run a tests with all used regular expression with a pattern.

for d in ls -1 regexes/9*.txt; do src/pcre4msc2 ${d} /path/to/subject.txt; done

Same test but filter the rules with extra long times.

for d in ls -1 regexes/9*.txt; do src/pcre4msc2 ${d} /path/to/subject.txt; done | grep "([1-9][0-9]|[1-9]).[0-9]"

Same tests but run it for libmodsecurity3 'engine'.

for d in ls -1 regexes/9*.txt; do src/pcre4msc3 ${d} /path/to/subject.txt; done

Use the filter here too.

for d in ls -1 regexes/9*.txt; do src/pcre4msc3 ${d} /path/to/subject.txt; done | grep "([1-9][0-9]|[1-9]).[0-9]"

Collect regex's where the run time over your limit (0.00001 sec)

for d in ls -1 regexes/9*.txt; do src/pcre4msc2 -t 0.00001 -j d/path/to/subject.txt;if[{d} /path/to/subject.txt; if [ d/path/to/subject.txt;if[? -ne 0 ]; then echo "Time limit exceeded: ${d}"; fi; done | grep "Time limit"

Same tests but use JIT.

for d in ls -1 regexes/9*.txt; do src/pcre4msc2 -j ${d} /path/to/subject.txt; done

Check again and compare the long times with 2nd example.

for d in ls -1 regexes/9*.txt; do src/pcre4msc2 -j ${d} /path/to/subject.txt; done | grep "([1-9][0-9]|[1-9]).[0-9]"

Note, see the difference in case of some patterns - it's awesome performance increase.

Run a test N times

src/pcre4msc3 -n 5 path/to/pattern path/to/subject path/to/pattern - time elapsed: 0.000016934, match value: SUBJECT MATCHED 1 TIME path/to/pattern - time elapsed: 0.000002035, match value: SUBJECT MATCHED 1 TIME path/to/pattern - time elapsed: 0.000001125, match value: SUBJECT MATCHED 1 TIME path/to/pattern - time elapsed: 0.000001013, match value: SUBJECT MATCHED 1 TIME path/to/pattern - time elapsed: 0.000001029, match value: SUBJECT MATCHED 1 TIME Num of values: 5 Mean: 000.000004427 Median: 000.000002035 Min: 000.000001013 Max: 000.000016934 Range: 000.000015921 Std deviation: 000.000006265

Statistics about run time if you run the tools with -n

Take a look to this example:

$ src/pcre4msc2 -j -n 10 pattern.txt subject.txt pattern.txt - time elapsed: 0.000003900, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000348, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000215, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000188, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000196, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000190, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000191, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000191, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000192, match value: SUBJECT MATCHED 1 TIME pattern.txt - time elapsed: 0.000000190, match value: SUBJECT MATCHED 1 TIME Num of values: 10 Mean: 000.000000580 Median: 000.000000194 Min: 000.000000188 Max: 000.000003900 Range: 000.000003712 Std deviation: 000.000001108

All time values are in nanoseconds. Below the last line you can see the quick summary:

Note: the summary will displayed only if you run the tool with -n N, where N > 1.

Change the match and/or recursion limit

src/pcre4msc2 -m 4000 -r 4000 path/to/pattern path/to/subject

See the difference between the modsec_v2 without JIT, modsec v2 with JIT, modsec v3, and old modsec v3 matches algorithm

echo "this is what is this" > subject.txt echo "(?:is)" > pattern.txt

v2

src/pcre4msc2 pattern.txt subject.txt pattern.txt - time elapsed: 0.000015, match value: SUBJECT MATCHED 1 TIME

v2 with JIT

src/pcre4msc2 -j pattern.txt subject.txt pattern.txt - time elapsed: 0.000007, match value: SUBJECT MATCHED 1 TIME

v3 original

src/pcre4msc3 pattern.txt subject.txt pattern.txt - time elapsed: 0.000014, match value: SUBJECT MATCHED 4 TIMES

v3 modified

src/pcre4msc3 -f pattern.txt subject.txt pattern.txt - time elapsed: 0.000010, match value: SUBJECT MATCHED 1 TIME

Of course, for a more precision result you have to run the tests with -n 10, or with more sophisticated pattern and subject.

Show the details with created files above (pattern.txt, subject.txt)*

src/pcre4msc2 -d pattern.txt subject.txt

RAW pattern:

is

ESCAPED pattern:

is

SUBJECT:

this is what is this

JIT:

available but disabled

STUDY:

enabled

MATCH LIMIT:

1000

MATCH LIMIT RECURSION:

1000

RESULT:

pattern.txt - time elapsed: 0.000019, match value: SUBJECT MATCHED 1 TIME

CAPTURES:

this is what is this

OVECTOR:

[2, 4]

where the firs is printed with green color.

Show the details with pattern what contains escape*

echo ""//onerror="" > subject2.txt src/pcre4msc2 -d regexes/941120_1.txt subject2.txt

RAW pattern:

(?i)[\s"'`;/0-9=\x0B\x09\x0C\x3B\x2C\x28\x3B]+on[a-zA-Z]+[\s\x0B\x09\x0C\x3B\x2C\x28\x3B]*?=

ESCAPED pattern:

(?i)[\s"'`;/0-9=\x0B\x09\x0C\x3B\x2C\x28\x3B]+on[a-zA-Z]+[\s\x0B\x09\x0C\x3B\x2C\x28\x3B]*?=

SUBJECT:

"//onerror="

JIT:

available but disabled

STUDY:

enabled

MATCH LIMIT:

1000

MATCH LIMIT RECURSION:

1000

RESULT:

regexes/941120_1.txt - time elapsed: 0.000015, match value: SUBJECT MATCHED 1 TIME

CAPTURES:

"//onerror="

OVECTOR:

[0, 11]

As you can see, the escaped pattern is shorter than the original.

Note, the sub-string between the two asterisks is also highlighted.

See the detailed output from the previous example with pcre4msc3

src/pcre4msc3 -d pattern.txt subject.txt

PATTERN:

is

SUBJECT:

this is what is this

JIT:

avaliable and used

RESULT:

pattern.txt - time elapsed: 0.000015, match value: SUBJECT MATCHED 4 TIMES

CAPTURES:

this is what is this this is what is this this is what is this this is what is this

OVECTOR:

[2, 4, 5, 7, 13, 15, 18, 20]

As you can see, the ModSecurity3 pattern matching code founds the all occurrence of sub-string. Note, that this isn't the escaped pattern part, because this engine doesn't do that. Match limit and match limit recursion also doesn't exists, because they aren't used.

See the screenshot:

Details