<regex>: Equivalence classes have unexpected behavior with std::wregex · Issue #5435 · microsoft/STL (original) (raw)

Repros with VS 2022 17.14 Preview 4 with microsoft/STL main, including #5392. Tracked by internal VSO-127463 / AB#127463 , originally reported by an external user through the defunct Microsoft Connect site on 2015-06-02.

#include #include #include #include using namespace std;

[[nodiscard]] string escape_wide(const wstring& wstr) { string ret{R"(L")"}; for (const auto& wch : wstr) { ret += format(R"(\x{:x})", static_cast(wch)); } ret += R"(")"; return ret; }

void display_result(const wstring& wstr, const wstring& pattern) { const locale loc{"fr-FR"};

wregex rgx;
rgx.imbue(loc);
rgx.assign(pattern, regex_constants::icase | regex_constants::collate);

const bool result = regex_match(wstr, rgx);

regex_traits<wchar_t> tr;
tr.imbue(loc);
const wstring primary_sort_key = tr.transform_primary(wstr.begin(), wstr.end());

println("wstr: {}; result: {:>5}; primary_sort_key: {}", escape_wide(wstr), result, escape_wide(primary_sort_key));

}

int main() { display_result(L"E", L"[[=e=]]"); display_result(L"\u00C8", L"[[=e=]]"); // LATIN CAPITAL LETTER E WITH GRAVE display_result(L"\u00C9", L"[[=e=]]"); // LATIN CAPITAL LETTER E WITH ACUTE display_result(L"\u00CA", L"[[=e=]]"); // LATIN CAPITAL LETTER E WITH CIRCUMFLEX

display_result(L"e", L"[[=e=]]");
display_result(L"\u00E8", L"[[=e=]]"); // LATIN SMALL LETTER E WITH GRAVE
display_result(L"\u00E9", L"[[=e=]]"); // LATIN SMALL LETTER E WITH ACUTE
display_result(L"\u00EA", L"[[=e=]]"); // LATIN SMALL LETTER E WITH CIRCUMFLEX

}

C:\Temp>cl /EHsc /nologo /W4 /std:c++latest /MTd /Od meow.cpp && meow
meow.cpp
wstr: L"\x45"; result:  true; primary_sort_key: L"\xe\x21\x1\x1\x1\x1\x0"
wstr: L"\xc8"; result: false; primary_sort_key: L"\xe\x21\x1\xf\x1\x1\x1\x0"
wstr: L"\xc9"; result: false; primary_sort_key: L"\xe\x21\x1\xe\x1\x1\x1\x0"
wstr: L"\xca"; result: false; primary_sort_key: L"\xe\x21\x1\x12\x1\x1\x1\x0"
wstr: L"\x65"; result:  true; primary_sort_key: L"\xe\x21\x1\x1\x1\x1\x0"
wstr: L"\xe8"; result: false; primary_sort_key: L"\xe\x21\x1\xf\x1\x1\x1\x0"
wstr: L"\xe9"; result: false; primary_sort_key: L"\xe\x21\x1\xe\x1\x1\x1\x0"
wstr: L"\xea"; result: false; primary_sort_key: L"\xe\x21\x1\x12\x1\x1\x1\x0"

The user expects regex_match to always return true here.

I don't understand why LCMapStringEx with LCMAP_SORTKEY is producing these primary sort keys. Are we supposed to be passing extra flags to ignore diacritics?