Avoid unnecessary ToLower calls in RegexCompiler generated code by stephentoub · Pull Request #35185 · dotnet/runtime (original) (raw)

What am I missing?

That * is greedy.

With the expression hello.*world and the input abc hello world world world 123, it will match hello world world world, not hello world. So, .* doesn't mean "skip over everything until you see the first "world'", it actually means "consume as many .s as possible, and then if the rest of the expression can't be matched at that point, backtrack until it can".

By default, . means "anything other than \n" (it's the same as the character class [^\n]), so .* means "skip over everything that's not a \n"... and rather than implementing that as "are you \n? are you \n? are you \n?" as we walk along the characters (which benefits from this change because we don't need to ToLower each such char), more importantly it let's us use the existing (new to .NET 5) optimization that employs IndexOf('\n').

This is why I was asking offline (thanks, btw) about what could lowercase to \n. It shows up a lot because of ..