[Python-Dev] Re: Be Honest about LC_NUMERIC [REPOST] (original) (raw)

Tim Peters tim.one at comcast.net
Mon Sep 1 22:20:55 EDT 2003


[Tim]

In short, I can't be enthusiastic about the patch because it doesn't solve the only relevant locale problem I've actually run into. I understand that it may well solve many I haven't run into.

[Guido]

At this point in your life, Tim, is there any patch you could be truly enthusiastic about? :-)

Yes, but I can't be enthusiastic about a hack, and especially not about a hack that (as I said) doesn't solve the real-life problem spambayes has.

I'm asking because I'd like to see the specific problem that started this thread solved,

At this point, can you state what that specific problem was ?

if necessary using a compromise that means the solution isn't perfect. I'm even willing to take a step back in the status quo, given that the status quo isn't perfect anyway, and that compromises mean something has to give.

Maybe the right solution is that we have to accept a hard-to-understand overcomplicated piece of code that we don't know how to maintain (but for which the author asserts that we won't have to do much maintenance in the foreseeable future).

I'm finding it hard to believe that anyone other than me and the author has actually read the patch! It's easy to understand. It's over-complicated for what Python needs, and would be dead easy to understand if the fluff got chopped. The fear of this code expressed in this thread is baffling to me, but I suspect it's due to initial shell-shock from the sheer bulk of the unnecessary code in the patch.

But maybe there's a simpler solution.

OTOH, the specific problem I'm acutely worried about would be better addressed by changing the way Python marhals float values.

So solve it.

Sorry, I don't foresee making time to do that.

The approach used by binary pickles seems entirely reasonable.

It's the best binary format we've got. It has problems with 754's special values (as recorded in PEP 42), and loses precision for VAX D format doubles (any double format with greater dynamic range or precision than IEEE-754 double). A decimal string is actually better on all those counts (dynamic range is no problem then; and some platforms can preserve IEEE special values via to-string-and-back conversion (Windows cannot)). Decimal strings lose on correctness only because of locale variations; depending on platform, they may also lose on speed, but I don't give much weight to speed here.

All we need to do is change the .pyc magic number. (There's undoubtedly user code in the world that would break because it requires interoperability between Python versions. So let the marshal module grow a way to specify the format.)

... Fair enough. So if we decide to use the donated conversion code, we should start by using it unconditionally. I predict that at some point in the future we'll find a platform whose quirks are not handled by the donated code, and where it's simpler to use a correct native equivalent than to try to fix the donated code; but I expect that point to be pretty far in the future, or the platform to be pretty far from the main stream.

Do read the patch. It amounts to

if decimal_point != '.':
    s/./decimal_point/

in one direction and

if decimal_point != '.':
    s/decimal_point/./

in the other. It gets its idea of decimal_point from the platform localeconv(), so if that doesn't lie it's hard to get wrong. In the double->string direction, though, the substitution code appears inadequate to me, since it doesn't try to strip out thousand-separation characters, which some locales produce. For example, on Windows,

locale.setlocale(locale.LCALL, "german") 'German_Germany.1252' locale.format("%g", 123456.0, 1) '123.456'

AFAICT, the patch will leave that output as "123.456". The string->double direction is much easier to be confident about for this reason.

... Here's yet another idea (which probably has flaws as well): instead of substituting the locale's decimal separator, rewrite strings like "3.14" as "314e-2" and rewrite strings like "3.14e5" as "314e3", then pass to strtod(), which assigns the same meaning to such strings in all locales.

[long example]

I fail to see the relevance of the example to my proposed hack, except as a proof that the world isn't perfect -- but we already know that.

The point is that only perfect-rounding string->float routines can guarantee to produce identical doubles from mathematically equivalent decimal string representations. Finding counterexamples for non-perfect-rounding libraries is extremely difficult, and/or time-consuming, without studying the source code of a specific library intensely (almost certainly with more intensity than its author gave to writing it!), and I don't have time for that. It's a potential vulnerability. Answering whether it's an actual vulnerability in practice is much more work than I can give to it now.

Under my proposal, the number of digits converted would never change, so any sensitivity of the algorithm used to the number of digits converted would be irrelevant. I note that the strtod.c code that's currently in the Python source tree uses a similar (though opposite) trick: it converts the number to the form 0.E before handing it off to atof(). So my proposal still stands. I'm happy to entertain a proof that it's flawed but not one where the flawed input has over 5000 digits and depends on a flaw in the platform routines.

As hacks go, it's probably OK. I don't think it can fail on glibc-based platforms because I think they do perfect-rounding conversions; the Windows conversion routines aren't perfect-rounding, but we don't have their source code so it's impossible for me to give examples offhand where different results could be delivered, or even to swear that there are (or aren't) such cases. I give it a lot of credit for being truly threadsafe.

Note that it doesn't address the other half of the locale conversion problem (double->string), which, as I noted above, is the harder half (due to thousands_sep becoming an additional issue).



More information about the Python-Dev mailing list