<locale>: make collate::hash return the same hash for strings that collate the same by muellerj2 · Pull Request #5469 · microsoft/STL (original) (raw)

Fixes #5212.

At first glance, calling LCMapStringEx with LCMAP_HASH seems to be the obvious solution, but the docs state (emphasis mine):

Strings that appear equivalent typically return the same hash (for example, "hello" and "HELLO" with LCMAP_IGNORECASE). However, some complex cases, such as East Asian languages, can have similar strings with identical weights that compare as equal but do not return the same hash.

I briefly tested this and the warning seems to be true, as the following program finds some single-character strings that collate the same but yield different hashes:

#include #include <windows.h>

using namespace std;

int main() { const wchar_t* locale = L"ja-JP"; for (wchar_t w = 1; w != 0; ++w) { int hashw; if (LCMapStringEx(locale, LCMAP_HASH, &w, 1, reinterpret_cast(&hashw), sizeof(int), nullptr, nullptr, 0) == 0) { continue; }

    for (wchar_t x = w + 1; x != 0; ++x) {
        int hashx;
        if (LCMapStringEx(locale, LCMAP_HASH, &x, 1, reinterpret_cast<LPWSTR>(&hashx), sizeof(int), nullptr, nullptr, 0) == 0) {
            continue;
        }
        if (hashw != hashx && CompareStringEx(locale, 0, &w, 1, &x, 1, nullptr, nullptr, 0) == CSTR_EQUAL) {
            cout << "found different hashes for characters collating the same at: " << int(w) << " : " << int(x) << '\n';
        }
    }
}

}

Output:

found different hashes for characters collating the same at: 1556 : 64606
found different hashes for characters collating the same at: 1611 : 65137
found different hashes for characters collating the same at: 1614 : 65143
[...]

This means that LCMapStringEx with LCMAP_HASH is not good enough to meet the requirements in [locale.collate.virtuals]/3.

For this reason, this PR computes the sort key first (by calling do_transform) and then hashes the sort key.