msg77603 - (view) |
Author: Adeodato Simó (dato) |
Date: 2008-12-11 11:23 |
This is very odd, but it was reproduced by people in #python as well. Compare, in python 2.5: >>> urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline() 'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n' To the equivalent in python 3.0: >>> urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline() b'f65\r\n' |
|
|
msg77605 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2008-12-11 12:12 |
I don't reproduce the problem: >>> urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline() b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n' I connect through a http proxy. |
|
|
msg77611 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2008-12-11 13:13 |
Confirmed: Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.request >>> urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines() [b'f65\r\n', b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n', ... Perhaps it's related to the \r at read boundaries bug? |
|
|
msg77612 - (view) |
Author: Jean-Paul Calderone (exarkun) *  |
Date: 2008-12-11 13:53 |
The "f65" is the chunk length for the first chunk returned when requesting that URL. A proxy could easily hide this by switching to a different transfer encoding. |
|
|
msg77775 - (view) |
Author: Jeremy Hylton (jhylton)  |
Date: 2008-12-14 02:54 |
Does the same thing happen with 2.6? Jeremy On Thu, Dec 11, 2008 at 8:53 AM, Jean-Paul Calderone <report@bugs.python.org> wrote: > > Jean-Paul Calderone <exarkun@divmod.com> added the comment: > > The "f65" is the chunk length for the first chunk returned when > requesting that URL. A proxy could easily hide this by switching to a > different transfer encoding. > > ---------- > nosy: +exarkun > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue4631> > _______________________________________ > _______________________________________________ > Python-bugs-list mailing list > Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/jeremy%40alum.mit.edu > > |
|
|
msg77778 - (view) |
Author: Adeodato Simó (dato) |
Date: 2008-12-14 10:02 |
> Does the same thing happen with 2.6? No, I can't reproduce with 2.6.1. |
|
|
msg77779 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2008-12-14 10:08 |
Jeremy: no, it doesn't. Python 2.6.1+ (release26-maint:67716M, Dec 13 2008, 10:30:52) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 ~/release26-maint$ ./python -c "import urllib; print urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]" From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001 ~/release26-maint$ ./python -c "from __future__ import unicode_literals; import urllib; print urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]" From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001 FWIW, there are trailing spurious bytes too (note read() gives bytes, while readlines() both bytes and strings in 3.0): >>> import urllib.request; content = urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read() Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.request >>> content = urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read() >>> content[-30:] b'PGP SIGNATURE-----\n\n\n\n\n\r\n0\r\n\r\n' >>> content[:10] b'f65\r\nFrom ' While in 2.6: >>> import urllib >>> content = urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read() >>> content[-30:] '---END PGP SIGNATURE-----\n\n\n\n\n' |
|
|
msg77780 - (view) |
Author: Adeodato Simó (dato) |
Date: 2008-12-14 10:11 |
> FWIW, there are trailing spurious bytes too And in the middle of the document as well. Each time there's a chunk, I guess? |
|
|
msg77781 - (view) |
Author: Resul Cetin (ResulCetin) |
Date: 2008-12-14 10:20 |
I have the same problem with that code: (exchange USERNAME with your delicious username and PASSWORD with your delicious password): import urllib.request auth_handler = urllib.request.HTTPBasicAuthHandler() auth_handler.add_password('del.icio.us API', 'api.del.icio.us', USERNAME, PASSWORD) opener = urllib.request.build_opener(auth_handler) print(str(opener.open('https://api.del.icio.us/v1/posts/all').read(20), "utf-8")) And I don't use a proxy or anything like that. This makes python 3 completely unusable for me. And python 2.6 gives me what I want (the content of that virtual file) without any extra data in front or in the middle of the content. |
|
|
msg77789 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2008-12-14 11:31 |
Took me a bit of Wiresharking to find this out, the problem is that we are asking for the page using HTTP 1.1 in 3.0. Here's a workaround patch for those who need it quick, I have yet to look at urllib to see what can be fixed there. --- Index: Lib/http/client.py =================================================================== --- Lib/http/client.py (revision 67716) +++ Lib/http/client.py (working copy) @@ -600,7 +600,7 @@ class HTTPConnection: _http_vsn = 11 - _http_vsn_str = 'HTTP/1.1' + _http_vsn_str = 'HTTP/1.0' response_class = HTTPResponse default_port = HTTP_PORT --- This is what we send in 2.5 and 3.0: GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.0 User-Agent: Python-urllib/1.17 GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.1 Accept-Encoding: identity User-Agent: Python-urllib/3.1 |
|
|
msg77794 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-12-14 12:12 |
IMO we should downgrade urlopen to HTTP 1.0 in 3.0.1. Implementing chunked encoding will come later if people are interested in doing it. |
|
|
msg77796 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2008-12-14 12:17 |
Clarifying the diagnosis, the offending spurious bytes are only present when we use 3.0's GET above. That's because urllib.request.HTTPHandler asks for a vanilla http.client.HTTPConnection, which uses HTTP 1.1. IIUC, either we change the request version back to 1.0 (attached patch) or correct the way the response is processed (is it at all?). I think HTTPSHandler will also suffer from this, perhaps [Fancy]URLopener too. [Antoine: cool, an edit conflict that agrees with what I was about to post :D] |
|
|
msg77846 - (view) |
Author: Jeremy Hylton (jhylton)  |
Date: 2008-12-15 04:01 |
Brief update: The Python 2.x code works because readline() is provided by socket._fileobject. The Python 3.x code fails because it grabs the HTTPResponse.fp instance variable at the end of AbstractHTTPHandler.do_open. That method needs to pass the response to addinfourl(), but needs to have support for readline / readlines before it can do that. |
|
|
msg77879 - (view) |
Author: Jeremy Hylton (jhylton)  |
Date: 2008-12-15 19:15 |
I have a patch here that seems to work for the specific url and that passes all the tests. Can anyone check whether it works for a larger set of cases? I'm a little concerned because I don't understand the new io library in much detail. There's an override for _checkClosed() in the HTTPResponse that seems a little dodgy. I'll try to get someone to review that specifically. |
|
|
msg77889 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2008-12-15 23:00 |
I think your patch is good, but there may be another bug around: I wrote a script to check results of 3.x against 2.x, but many pages (http://groups.google.com/, http://en.wikipedia.org/) give 403: Forbidden for 3.x... but work with 2.x! If you think of this as a bug in 3.x, it could retry the request identifying as 2.x on 403. Other than that, your patch gives me identical results to 2.5/2.6 for 128 sites I tested (only a read(100) for each). Interestingly, my patched version gives a file closer to the buggy version in size, at 12700 bytes versus 12707. Your version agrees with 2.x and simple maths (128 x 100) in giving a 12799 bytes result. I have no idea why. HTH, Daniel |
|
|
msg78144 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-12-21 13:24 |
The patch should have at least a test so that we don't have a regression on this one. |
|
|
msg81360 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2009-02-08 01:17 |
Here's a test (in test_urllib2_localnet) that fails before the patch and passes after, mostly lifted from test_httplib: def test_chunked(self): expected_response = b"hello world" chunked_start = ( b'a\r\n' b'hello worl\r\n' b'1\r\n' b'd\r\n' ) response = [(200, [("Transfer-Encoding", "chunked")], chunked_start)] handler = self.start_server(response) data = self.urlopen("http://localhost:%s/" % handler.port) self.assertEquals(data, expected_response) Output: test test_urllib2_localnet failed -- Traceback (most recent call last): File "~/py3k/Lib/test/test_urllib2_localnet.py", line 390, in test_chunked self.assertEquals(data, expected_response) AssertionError: b'a\r\nhello worl\r\n1\r\nd\r\n' != b'hello world' To allow this test to work, the attached patch also touches FakeHTTPRequestHandler and TestUrlopen.urlopen. |
|
|
msg81385 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-08 15:58 |
On the principle, the test looks good. If you want to avoid the 'if "%" in value' hack, you can use the named-parameter form of string formatting: >>> "localhost:%(port)s" % dict(port=8080) 'localhost:8080' >>> "localhost" % dict(port=8080) 'localhost' |
|
|
msg81428 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2009-02-09 00:24 |
Antoine, Thanks for reviewing, here's an updated version. |
|
|
msg81433 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-09 02:23 |
The test looks good to me. I can't comment on the bugfix patch, but if it's ok to you, you can go ahead :) |
|
|
msg81610 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-10 22:31 |
I took a look at the patch and it looks ok, apart from the _checkClosed() hack (but I don't think there's any immediate solution). It should be noted that HTTPResponse.readline() will be awfully slow since, as HTTPResponse doesn't have peek(), readline() will call read() one byte at a time... (slow I/O is nothing new in py3k, however :-)) |
|
|
msg81611 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-10 23:03 |
Here is a patch without the _checkClosed() hack. The solution is simply to remove redundant _checkClosed() calls in IOBase (for example, readline() doesn't need to do an explicit `closed` check as it calls read()). |
|
|
msg81617 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-11 01:02 |
Committed in r69513, r69514. Thanks everyone! |
|
|