[Python-Dev] teaching the new urllib (original) (raw)

python-3000 at udmvt.ru python-3000 at udmvt.ru
Wed Feb 4 10:14:16 CET 2009

Previous message: [Python-Dev] teaching the new urllib
Next message: [Python-Dev] teaching the new urllib
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Feb 03, 2009 at 06:50:44PM -0500, Tres Seaver wrote:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

The encoding information is available in the response headers, e.g.: - ---------------------- %< ---------------------------------_ _$ wget -S --spider http://knuth.luther.edu/test.html_ _- --18:46:24-- http://knuth.luther.edu/test.html_ _=> `test.html' Resolving knuth.luther.edu... 192.203.196.71 Connecting to knuth.luther.edu|192.203.196.71|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Tue, 03 Feb 2009 23:46:28 GMT Server: Apache/2.0.50 (Linux/SUSE) Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT ETag: "2fcd8-1d8-43b2bf40" Accept-Ranges: bytes Content-Length: 472 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: text/html; charset=ISO-8859-1 Length: 472 [text/html] 200 OK - ---------------------- %< --------------------------------- So, the OP's use case could be satisfied, assuming that the Py3K version of urllib sprouted a means of leveraging that header. In this sense, fetching the resource over HTTP is better than loading it from a file: information about the character set is explicit, and highly likely to be correct, at least for any resource people expect to render cleanly in a browser.

First of all, as it was noted, Content-Type may have no charset parameter, or be omitted at all. But the most important and the worst is that charset in Content-Type may have no relation to charset in document. And even worse - charset specified in document may have no relation to charset used to encode the document. :(

Remember, that headers are supplied by HTTP server and it have to read document from just a file, so there is no difference, since there is no magic in being a HTTP server. Ofcourse it will be correct to provide web-server with some hints about charset of byte-encoded text documents, but web-server will not stop working without charset specified or with incorrect charset.

This use case is really important for those international segments of Internet, which have two or more conflicting character sets for their (single) alphabet. As an example - every Russian Internet user can tell you that a browser, that have no menu option to select explicitly what encoding to use for current document, is completely unusable.

-- Alexey Shpagin

Previous message: [Python-Dev] teaching the new urllib
Next message: [Python-Dev] teaching the new urllib
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list