[Python-Dev] Encoding detection in the standard library? (original) (raw)

Mike Klaas mike.klaas at gmail.com
Wed Apr 23 01:10:17 CEST 2008


On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote:

Any program that needs to examine the contents of documents/feeds/whatever on the web needs to deal with incorrectly-specified encodings That's not true. Most programs that need to examine the contents of a web page don't need to guess the encoding. In most such programs, the encoding can be hard-coded if the declared encoding is not correct. Most such programs know what page they are webscraping, or else they couldn't extract the information out of it that they want to get at.

I certainly agree that if the target set of documents is small enough
it is possible to hand-code the encoding. There are many
applications, however, that need to examine the content of an
arbitrary, or at least non-small set of web documents. To name a few
such applications:

As for feeds - can you give examples of incorrectly encoded one (I don't ever use feeds, so I honestly don't know whether they are typically encoded incorrectly. I've heard they are often XML, in which case I strongly doubt they are incorrectly encoded)

I also don't have much experience with feeds. My statement is based
on the fact that chardet, the tool that has been cited most in this
thread, was written specifically for use with the author's feed
parsing package.

As for "whatever" - can you give specific examples?

Not that I can substantiate. Documents & feeds covers a lot of what
is on the web--I was only trying to make the point that on the web,
whenever an encoding can be specified, it will be specified
incorrectly for a significant chunk of exemplars.

(which, sadly, is rather common). The set of programs of programs that need this functionality is probably the same set that needs BeautifulSoup--I think that set is larger than just browsers Again, can you give specific examples that are not web browsers? Programs needing BeautifulSoup may still not need encoding guessing, since they still might be able to hard-code the encoding of the web page they want to process.

Indeed, if it is only one site it is pretty easy to work around. My
main use of python is processing and analyzing hundreds of millions of
web documents, so it is pretty easy to see applications (which I have
listed above). I think that libraries like Mark Pilgrim's FeedParser
and BeautifulSoup are possible consumers of guessing as well.

In any case, I'm very skeptical that a general "guess encoding" module would do a meaningful thing when applied to incorrectly encoded HTML pages.

Well, it does. I wish I could easily provide data on how often it is
necessary over the whole web, but that would be somewhat difficult to
generate. I can say that it is much more important to be able to
parse all the different kinds of encoding specification on the web
(Content-Type/Content-Encoding/<meta http-equiv tags, etc), and the
malformed cases of these.

I can also think of good arguments for excluding encoding detection
for maintenance reasons: is every case of the algorithm guessing wrong
a bug that needs to be fixed in the stdlib? That is an unbounded
commitment.

-Mike



More information about the Python-Dev mailing list