[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)
Stephen J. Turnbull stephen at xemacs.org
Sun Apr 26 15:47:44 CEST 2009
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Paul Moore writes:
2009/4/24 Stephen J. Turnbull <stephen at xemacs.org>:
Paul Moore writes:
> The pros for Martin's proposal are a uniform cross-platform interface, > and a user-friendly API for the common case.
A more accurate phrasing would be "... a user-friendly API for those who feel very lucky today." Which is the common case, of course, but spins a little differently.
Sorry, but I think you're misrepresenting things. I'd have probably let you off if you'd missed out the "very" - but I do think that it's the common case. Consider:
If you need reliability, then you can't get it this way. The reason "very" is (somewhat) justified is that this kind of issue is a little like unemployment. You hardly ever meet someone who's 7.2% unemployed, but you probably know several who are 100% unemployed. If you see a broken encoding once, you're likely to see it a million times (spammers have the most broken software) or maybe have it raise an unhandled Exception a dozen times (in rate of using busted software, the spammers are closely followed by bosses---which would be very bad, eh, if you 2/3 of the mail from your boss ends up in an undeliverables queue due to encoding errors that are unhandled by your some filter in your mail pipeline).
- Windows systems where broken Unicode (lone surrogates or whatever) isn't involved
- Unix systems where the user's stated filesystem encoding is correct
Can you honestly say that this isn't the vast majority of real-world environments?
Again, that's not the point. The point is that six-sigma reliability world-wide is not going to be very comforting to the poor souls who happen to have broken software in their environment sending broken encodings regularly, because they're going to be dealing with one or two sigmas, and that's just not good enough in a production environment.
If you didn't start with a valid string in a known encoding, you shouldn't treat it as characters because it's not.
Again, that's the purist argument. If you have a string (of bytes, I guess) and a 99% certain guess as to the correct encoding, then I'd argue that, as long as (a) it's not mission-critical (lives or backups depend on it)
Assurance that you can even determine (a) is not provided by the PEP. There is no way to contain a problem if it should occur, because it's "just a string" and could go anywhere, and get converted back or otherwise manipulated in a context that doesn't know how to handle it (which might not even be Python if a C-level extension is involved). Given that Python has no internal mechanism for saying "in this area only valid Unicode will be accepted", it seems likely that mission critical software will interact with this feature, if only indirectly (or perhaps only in software originally intended for use in the U.S. only, but then it gets exported, etc).
and (b) you have a means of failing relatively gracefully, you have every reason to make the assumption about encoding.
(b) is not provided in the PEP, either. We have no idea what the failure mode will be.
After all, what's the alternative?
The alternative is to refuse to provide a simple standard way to decode unreliably, and in that way make the user reponsible for an explicit choice about what level and kinds of unreliability they will accept.
I realize that's unpalatable to most people who use Python to develop software, and so I'm unwilling to go even -0 on the PEP. However, to give one example, I've been following Mailman development for about 10 years, and it is a dismal story despite a group of developers very sympathetic to encoding and multicultural issues. As recently as Mailman 2.10 (IIRC) there were still bugs in encoding handling that could stop the show (ie, not only did the buggy post not get processed, but the exception propagated high enough to cause everything behind it in the queue to fail, too). I think it would be sad if ten years from now there was software using this technique and failing occasionally.
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]