regexp: confusing behavior on invalid utf-8 sequences (original) (raw)
The following program:
package main
import "regexp"
func main() { re := regexp.MustCompile(".") println(re.MatchString("\xd1")) println(re.MatchString("\xd1\x84")) println(re.MatchString("\xd1\xd1")) re = regexp.MustCompile("..") println(re.MatchString("\xd1")) println(re.MatchString("\xd1\x84")) println(re.MatchString("\xd1\xd1")) }
prints:
true
true
true
false
false
true
While the following C++ program:
#include <stdio.h> #include <re2/re2.h>
int main() { RE2 re1("."); printf("%d\n", RE2::PartialMatch("\xd1", re1)); printf("%d\n", RE2::PartialMatch("\xd1\x84", re1)); printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1)); RE2 re2("."); printf("%d\n", RE2::PartialMatch("\xd1", re2)); printf("%d\n", RE2::PartialMatch("\xd1\x84", re2)); printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2)); }
prints:
This raises 2 questions:
- Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
- Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?
go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64