[Python-Dev] Encoding detection in the standard library? (original) (raw)

Jim Jewett jimjjewett at gmail.com
Tue Apr 22 05:30:18 CEST 2008

Previous message: [Python-Dev] Encoding detection in the standard library?
Next message: [Python-Dev] python hangs when parsing a bad-formed email
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

David Wolever wrote:

IMO, encoding estimation is something that many web programs will have to deal with, so it might as well be built in; I would prefer the option to run text=input.encode('guess') (or something similar) than relying on an external dependency or worse yet using a hand-rolled algorithm

The (still draft) html5 spec is trying to get error-correction standardized, so it includes all sort of "if this fails, do X". Encoding detection will be standardized, so there will be an external standard that we can reference.

http://dev.w3.org/html5/spec/Overview.html#determining

Note that this portion of the spec is probably not stable yet, as there was some new analysis on which "wrong" answers provided better results on real world web pages.

e.g.,

http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014127.html

http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014190.html

There was also a recent analysis of how many characters it takes to sniff successfully X% of the time on today's web, though I'm not finding it at the moment.

-jJ

Previous message: [Python-Dev] Encoding detection in the standard library?
Next message: [Python-Dev] python hangs when parsing a bad-formed email
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list