[Python-Dev] Encoding detection in the standard library? (original) (raw)

David Wolever wolever at cs.toronto.edu
Tue Apr 22 17:48:07 CEST 2008


On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote:

IMO, encoding estimation is something that many web programs will have to deal with Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need to support web pages that specify the encoding incorrectly. Two cases come immediately to mind: email and web forms. When a web browser POSTs data, there is no standard way of
communicating which encoding it's using. There are some hints which
make it easier (accept-charset attributes, the encoding used to send
the page to the browser), but no guarantees. Email is a smaller problem, because it usually has a helpful content- type header, but that's no guarantee.

Now, at the moment, the only data I have to support this claim is my
experience with DrProject in non-English locations. If I'm the only one who has had these sorts of problems, I'll go back
to "Unicode for Dummies".

so it might as well be built in; I would prefer the option to run text=input.encode('guess') (or something similar) than relying on an external dependency or worse yet using a hand-rolled algorithm. Ok, let me try differently then. Please feel free to post a patch to bugs.python.org, and let other people rip it apart. For example, I don't think it should be a codec, as I can't imagine it working on streams.

As things frequently are, it seems like this is a much larger problem
that I originally believed.

I'll go back and take another look at the problem, then come back if
new revelations appear.



More information about the Python-Dev mailing list