[spambayes-dev] RE: [Python-Dev] RE: [Spambayes] Question (or possibly a bug report) (original) (raw)

Tim Peters tim.one@comcast.net
Thu, 24 Jul 2003 23:08:34 -0400


[Skip Montanaro]

Jeez, this locale crap makes Unicode look positively delightful...

Yes, it does! locale is what you get when someone complains they like to use ampersands instead commas to separate thousands, and a committee thinks "hey! we've got all these great functions already, so why change them? instead we'll add mounds of hidden global state that affects lots of ancient functions in radical ways!". Make sure it's as hostile to threads as possible, decline to define any standard locale names beyond "C" and the empty string, and decline to define what anything except the "C" locale name means, and you're almost there. The finishing touches come in the function definitions, like this in strtod():

In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.

What those may be aren't constrained in any way, of course.

locale can be cool in a monolithic, single-threaded, one-platform program, provided the platform C made up rules you can live with for the locales you care about. It's more of an API framework than a solution, and portable programs really can't use it except via forcing locale back to "C" every chance they get .

The SB Windows triumvirate (Mark, Tim, Tony) seem to have narrowed down the problem quite a bit. Is there some way to worm around it? I take it with the unmarshalling problem it's not sufficient to specify floating point values without decimal points (e.g., 0.12 == 1e-1+2e-2).

When true division becomes the default, things like

12/100

should work reliably regardless of locale -- i.e., don't use any float literals, and you can't get screwed by locale float-literal quirks. Today, absurd spellings like

float(12)/100

can accomplish the same.

Changing Python is a better solution. The rule that an embedded Python requires that LC_NUMERIC be "C" isn't livable -- embedded Python is a fly trying to stare down an elephant, in Outlook's case. I dragged python-dev into this to illustrate that it's a very real problem in a very popular kick-ass Python app. Note that this same problem was discussed in more abstract terms by others here within the last few weeks, and I hope that making it more concrete helps get the point across.

The float-literal-in-.pyc problem could be addressed in several ways. Binary pickles, and the struct module, use a portable binary float format that isn't subject to locale quirks. I think marshal should be changed to use that too, by adding an additional marshal float format (so old marshals would continue to be readable, but new marshals may not be readable under older Pythons). Note that text-mode pickles of floats are vulnerable to locale nightmares too.

Is the proposed early specification of a locale in the config file sufficient to make things work?

I doubt it, as Outlook can switch locale any time it feels like it. We can't control that. I think we should set a line-tracing hook, and force locale back to "C" on every callback .

A foreign user of the nascent CSV module beat us up a bit during development about not supporting different locales (I guess in Brazil the default separator is a semicolon, which makes sense if your decimal "point" is a comma). Thank God we ignored him! ;-)

Ya, foreigners are no damn good .