regexp: confusing behavior on invalid utf-8 sequences (original) (raw)

The following program:

package main

import "regexp"

func main() { re := regexp.MustCompile(".") println(re.MatchString("\xd1")) println(re.MatchString("\xd1\x84")) println(re.MatchString("\xd1\xd1")) re = regexp.MustCompile("..") println(re.MatchString("\xd1")) println(re.MatchString("\xd1\x84")) println(re.MatchString("\xd1\xd1")) }

prints:

true
true
true
false
false
true

While the following C++ program:

#include <stdio.h> #include <re2/re2.h>

int main() { RE2 re1("."); printf("%d\n", RE2::PartialMatch("\xd1", re1)); printf("%d\n", RE2::PartialMatch("\xd1\x84", re1)); printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1)); RE2 re2("."); printf("%d\n", RE2::PartialMatch("\xd1", re2)); printf("%d\n", RE2::PartialMatch("\xd1\x84", re2)); printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2)); }

prints:

This raises 2 questions:

  1. Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
  2. Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64