[Python-Dev] Encoding detection in the standard library? (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Apr 23 06:56:47 CEST 2008


"Martin v. Löwis" writes:

In any case, I'm very skeptical that a general "guess encoding" module would do a meaningful thing when applied to incorrectly encoded HTML pages.

That depends on whether you can get meaningful information about the language from the fact that you're looking at the page. In the browser context, for one, 99.44% of users are monolingual, so you only have to distinguish among the encodings for their language. In this context a two stage process of determining a category of encoding (eg, ISO 8859, ISO 2022 7-bit, ISO 2022 8-bit multibyte, UTF-8, etc), and then picking an encoding from the category according to a user-specified configuration has served Emacs/MULE users very well for about 20 years.

It does not work in a context where multiple encodings from the same category are in use (eg, the email folder of a Polish Gastarbeiter in Berlin).

Nonetheless it is pretty useful for user agents like mail clients, web browsers, and editors.



More information about the Python-Dev mailing list