mishandles locale-based character classes outside of the char range · Issue #992 · microsoft/STL (original) (raw)
Describe the bug
Regex does not handle non-ASCII characters.
@BillyONeal comments:
This is a longstanding bug in our regex engine -- when we form negated character classes (like
\S
), we negate the bitmap used for encoding units in the range [0-255], but don't have correct handling for encoding units outside that. We've known about this problem since at least October 5 of 2016, but it's ABI breaking to fix :(.
Command-line test case
d:\Temp2>type repro.cpp
#include <regex>
#include <iostream>
bool test(std::wstring line, std::wstring query)
{
std::wregex regex(query);
std::wsmatch res;
return std::regex_search(line, res, regex);
}
int main()
{
std::cout << test(L"xxxYxx\x0078xxxZxxx", L"Y\\S*Z") << std::endl; // 0078 is small Latin X
std::cout << test(L"xxxYxx\xCF87xxxZxxx", L"Y\\S*Z") << std::endl; // CF87 is small Greek Chi
}
d:\Temp2>cl /EHsc /W4 /WX .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29009.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
repro.cpp
Microsoft (R) Incremental Linker Version 14.27.29009.1
Copyright (C) Microsoft Corporation. All rights reserved.
/out:repro.exe
repro.obj
d:\Temp2>.\repro.exe
1
0
Expected behavior
Given example should match, the correct output is:
STL version
Microsoft Visual Studio Professional 2019 Preview
Version 16.7.0 Preview 3.1
Additional context
Original repro:
#include <regex>
#include <iostream>
int main()
{
std::wstring line = L"受注、製造、購買オーダのスケジューリングと自動補充生産機能";
std::wstring query = L"造\\S*オ";
std::wregex regex(query);
std::wsmatch res;
bool found = std::regex_search(line, res, regex);
std::cout << found << std::endl;
}
This item is also tracked on Developer Community as DevCom-984204 and by Microsoft-internal VSO-273702 / AB#273702.
See also #405
vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.