Message 146002 - Python tracker (original) (raw)

Christian Heimes wrote: There is no generic and simple way to detect the encoding of a remote site. Sometimes the encoding is mentioned in the HTTP header, sometimes it's embedded in the section of the HTML document.

FWIW for HTML pages the encoding can be specified in at least 3 places:

the HTTP headers: e.g. "content-type: text/html; charset=utf-8";
the XML declaration: e.g. "";
the tag: e.g. "

Browsers usually follow this order while searching the encoding, meaning that HTTP headers have the highest priority. The XML declaration is sometimes (mis)used in (X)HTML pages.

Anyway, since urlopen() is a generic function that can download anything, it shouldn't look at XML declarations and meta tags -- that's something parsers should take care of.

Regarding the implementation, wouldn't having a new method on the file-like object returned by urlopen better? Maybe something like:

page = urlopen(some_url) page.encoding # get the encoding from the HTTP headers 'utf-8' page.decode() # same as page.read().decode(page.encoding) '...'

The advantage of having these as new methods/attribute is that you can pass the 'page' around and other functions can get back the decoded content if/when they need to. OTOH other file-like objects don't have similar methods, so it might get a bit confusing.