[Python-Dev] Encoding detection in the standard library? (original) (raw)
Jim Jewett jimjjewett at gmail.com
Tue Apr 22 05:30:18 CEST 2008
- Previous message: [Python-Dev] Encoding detection in the standard library?
- Next message: [Python-Dev] python hangs when parsing a bad-formed email
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
David Wolever wrote:
IMO, encoding estimation is something that many web programs will have to deal with, so it might as well be built in; I would prefer the option to run
text=input.encode('guess')
(or something similar) than relying on an external dependency or worse yet using a hand-rolled algorithm
The (still draft) html5 spec is trying to get error-correction standardized, so it includes all sort of "if this fails, do X". Encoding detection will be standardized, so there will be an external standard that we can reference.
http://dev.w3.org/html5/spec/Overview.html#determining
Note that this portion of the spec is probably not stable yet, as there was some new analysis on which "wrong" answers provided better results on real world web pages.
e.g.,
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014127.html
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014190.html
There was also a recent analysis of how many characters it takes to sniff successfully X% of the time on today's web, though I'm not finding it at the moment.
-jJ
- Previous message: [Python-Dev] Encoding detection in the standard library?
- Next message: [Python-Dev] python hangs when parsing a bad-formed email
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]