mishandles locale-based character classes outside of the char range · Issue #992 · microsoft/STL (original) (raw)

Describe the bug
Regex does not handle non-ASCII characters.

@BillyONeal comments:

This is a longstanding bug in our regex engine -- when we form negated character classes (like \S), we negate the bitmap used for encoding units in the range [0-255], but don't have correct handling for encoding units outside that. We've known about this problem since at least October 5 of 2016, but it's ABI breaking to fix :(.

Command-line test case

d:\Temp2>type repro.cpp
#include <regex>
#include <iostream>

bool test(std::wstring line, std::wstring query)
{
    std::wregex regex(query);
    std::wsmatch res;
    return std::regex_search(line, res, regex);
}

int main()
{
    std::cout << test(L"xxxYxx\x0078xxxZxxx", L"Y\\S*Z") << std::endl; // 0078 is small Latin X
    std::cout << test(L"xxxYxx\xCF87xxxZxxx", L"Y\\S*Z") << std::endl; // CF87 is small Greek Chi 
}

d:\Temp2>cl /EHsc /W4 /WX .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29009.1 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

repro.cpp
Microsoft (R) Incremental Linker Version 14.27.29009.1
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:repro.exe
repro.obj

d:\Temp2>.\repro.exe
1
0

Expected behavior
Given example should match, the correct output is:

STL version

Microsoft Visual Studio Professional 2019 Preview
Version 16.7.0 Preview 3.1

Additional context
Original repro:

#include <regex>
#include <iostream>

int main()
{
    std::wstring line = L"受注、製造、購買オーダのスケジューリングと自動補充生産機能";
    std::wstring query = L"造\\S*オ";
    std::wregex regex(query);
    std::wsmatch res;
    bool found = std::regex_search(line, res, regex);

    std::cout << found << std::endl;
}

This item is also tracked on Developer Community as DevCom-984204 and by Microsoft-internal VSO-273702 / AB#273702.

See also #405

vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.