[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues (original) (raw)

Stefan Krah stefan at bytereef.org
Thu Dec 8 10:17:52 CET 2011


Victor Stinner <victor.stinner at haypocalc.com> wrote:

For localeconv(), it is the b'\xA0' byte string decoded from an encoding looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 locale anymore, only UTF-8 locales (which is much better!). I'm unable to reproduce the issue on my OpenIndiana VM.

I'm think that b'\xA0' is a valid thousands separator. The 'fi_FI' locale also uses that. Decimal.format() has to handle the 'n' specifier, which takes the thousands separator directly from localeconv(). Currently I have this horrible function to deal with the problem:

/* Convert decimal_point or thousands_sep, which may be multibyte or in the range [128, 255], to a UTF8 string. */ static PyObject * dotsep_as_utf8(const char *s) { PyObject *utf8; PyObject *tmp; wchar_t buf[2]; size_t n;

    n = mbstowcs(buf, s, 2);
    if (n != 1) { /* Issue #7442 */
            PyErr_SetString(PyExc_ValueError,
                "invalid decimal point or unsupported "
                "combination of LC_CTYPE and LC_NUMERIC");
            return NULL;
    }
    tmp = PyUnicode_FromWideChar(buf, n);
    if (tmp == NULL) {
            return NULL;
    }
    utf8 = PyUnicode_AsUTF8String(tmp);
    Py_DECREF(tmp);
    return utf8;

}

The main issue is that there is no portable function mbst_to_utf8() that uses the current locale. If possible, it would be great to have such a thing in the C-API.

I'm not sure why the b'\xA0' problem only occurs in Solaris. Many systems have this thousands separator.

Stefan Krah



More information about the Python-Dev mailing list