Issue 4631: urlopen returns extra, spurious bytes (original) (raw)

Issue4631

Created on 2008-12-11 11:23 by dato, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
urllib_bytes.diff ajaksu2,2008-12-14 12:17 Makes urlib.request.urlopen.HTTPHandler use HTTP 1.0 again
urllib-chunked.diff jhylton,2008-12-15 19:15
test_urllib_chunked2.diff ajaksu2,2009-02-09 00:24 Tests "Transfer-Encoding: chunked" handling in urllib
urllib-chunked2.diff pitrou,2009-02-10 23:03
Messages (23)
msg77603 - (view) Author: Adeodato Simó (dato) Date: 2008-12-11 11:23
This is very odd, but it was reproduced by people in #python as well. Compare, in python 2.5: >>> urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline() 'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n' To the equivalent in python 3.0: >>> urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline() b'f65\r\n'
msg77605 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-12-11 12:12
I don't reproduce the problem: >>> urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline() b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n' I connect through a http proxy.
msg77611 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2008-12-11 13:13
Confirmed: Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.request >>> urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines() [b'f65\r\n', b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n', ... Perhaps it's related to the \r at read boundaries bug?
msg77612 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2008-12-11 13:53
The "f65" is the chunk length for the first chunk returned when requesting that URL. A proxy could easily hide this by switching to a different transfer encoding.
msg77775 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2008-12-14 02:54
Does the same thing happen with 2.6? Jeremy On Thu, Dec 11, 2008 at 8:53 AM, Jean-Paul Calderone <report@bugs.python.org> wrote: > > Jean-Paul Calderone <exarkun@divmod.com> added the comment: > > The "f65" is the chunk length for the first chunk returned when > requesting that URL. A proxy could easily hide this by switching to a > different transfer encoding. > > ---------- > nosy: +exarkun > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue4631> > _______________________________________ > _______________________________________________ > Python-bugs-list mailing list > Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/jeremy%40alum.mit.edu > >
msg77778 - (view) Author: Adeodato Simó (dato) Date: 2008-12-14 10:02
> Does the same thing happen with 2.6? No, I can't reproduce with 2.6.1.
msg77779 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2008-12-14 10:08
Jeremy: no, it doesn't. Python 2.6.1+ (release26-maint:67716M, Dec 13 2008, 10:30:52) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 ~/release26-maint$ ./python -c "import urllib; print urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]" From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001 ~/release26-maint$ ./python -c "from __future__ import unicode_literals; import urllib; print urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]" From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001 FWIW, there are trailing spurious bytes too (note read() gives bytes, while readlines() both bytes and strings in 3.0): >>> import urllib.request; content = urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read() Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib.request >>> content = urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read() >>> content[-30:] b'PGP SIGNATURE-----\n\n\n\n\n\r\n0\r\n\r\n' >>> content[:10] b'f65\r\nFrom ' While in 2.6: >>> import urllib >>> content = urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read() >>> content[-30:] '---END PGP SIGNATURE-----\n\n\n\n\n'
msg77780 - (view) Author: Adeodato Simó (dato) Date: 2008-12-14 10:11
> FWIW, there are trailing spurious bytes too And in the middle of the document as well. Each time there's a chunk, I guess?
msg77781 - (view) Author: Resul Cetin (ResulCetin) Date: 2008-12-14 10:20
I have the same problem with that code: (exchange USERNAME with your delicious username and PASSWORD with your delicious password): import urllib.request auth_handler = urllib.request.HTTPBasicAuthHandler() auth_handler.add_password('del.icio.us API', 'api.del.icio.us', USERNAME, PASSWORD) opener = urllib.request.build_opener(auth_handler) print(str(opener.open('https://api.del.icio.us/v1/posts/all').read(20), "utf-8")) And I don't use a proxy or anything like that. This makes python 3 completely unusable for me. And python 2.6 gives me what I want (the content of that virtual file) without any extra data in front or in the middle of the content.
msg77789 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2008-12-14 11:31
Took me a bit of Wiresharking to find this out, the problem is that we are asking for the page using HTTP 1.1 in 3.0. Here's a workaround patch for those who need it quick, I have yet to look at urllib to see what can be fixed there. --- Index: Lib/http/client.py =================================================================== --- Lib/http/client.py (revision 67716) +++ Lib/http/client.py (working copy) @@ -600,7 +600,7 @@ class HTTPConnection: _http_vsn = 11 - _http_vsn_str = 'HTTP/1.1' + _http_vsn_str = 'HTTP/1.0' response_class = HTTPResponse default_port = HTTP_PORT --- This is what we send in 2.5 and 3.0: GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.0 User-Agent: Python-urllib/1.17 GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.1 Accept-Encoding: identity User-Agent: Python-urllib/3.1
msg77794 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-14 12:12
IMO we should downgrade urlopen to HTTP 1.0 in 3.0.1. Implementing chunked encoding will come later if people are interested in doing it.
msg77796 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2008-12-14 12:17
Clarifying the diagnosis, the offending spurious bytes are only present when we use 3.0's GET above. That's because urllib.request.HTTPHandler asks for a vanilla http.client.HTTPConnection, which uses HTTP 1.1. IIUC, either we change the request version back to 1.0 (attached patch) or correct the way the response is processed (is it at all?). I think HTTPSHandler will also suffer from this, perhaps [Fancy]URLopener too. [Antoine: cool, an edit conflict that agrees with what I was about to post :D]
msg77846 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2008-12-15 04:01
Brief update: The Python 2.x code works because readline() is provided by socket._fileobject. The Python 3.x code fails because it grabs the HTTPResponse.fp instance variable at the end of AbstractHTTPHandler.do_open. That method needs to pass the response to addinfourl(), but needs to have support for readline / readlines before it can do that.
msg77879 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2008-12-15 19:15
I have a patch here that seems to work for the specific url and that passes all the tests. Can anyone check whether it works for a larger set of cases? I'm a little concerned because I don't understand the new io library in much detail. There's an override for _checkClosed() in the HTTPResponse that seems a little dodgy. I'll try to get someone to review that specifically.
msg77889 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2008-12-15 23:00
I think your patch is good, but there may be another bug around: I wrote a script to check results of 3.x against 2.x, but many pages (http://groups.google.com/, http://en.wikipedia.org/) give 403: Forbidden for 3.x... but work with 2.x! If you think of this as a bug in 3.x, it could retry the request identifying as 2.x on 403. Other than that, your patch gives me identical results to 2.5/2.6 for 128 sites I tested (only a read(100) for each). Interestingly, my patched version gives a file closer to the buggy version in size, at 12700 bytes versus 12707. Your version agrees with 2.x and simple maths (128 x 100) in giving a 12799 bytes result. I have no idea why. HTH, Daniel
msg78144 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-21 13:24
The patch should have at least a test so that we don't have a regression on this one.
msg81360 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-02-08 01:17
Here's a test (in test_urllib2_localnet) that fails before the patch and passes after, mostly lifted from test_httplib: def test_chunked(self): expected_response = b"hello world" chunked_start = ( b'a\r\n' b'hello worl\r\n' b'1\r\n' b'd\r\n' ) response = [(200, [("Transfer-Encoding", "chunked")], chunked_start)] handler = self.start_server(response) data = self.urlopen("http://localhost:%s/" % handler.port) self.assertEquals(data, expected_response) Output: test test_urllib2_localnet failed -- Traceback (most recent call last): File "~/py3k/Lib/test/test_urllib2_localnet.py", line 390, in test_chunked self.assertEquals(data, expected_response) AssertionError: b'a\r\nhello worl\r\n1\r\nd\r\n' != b'hello world' To allow this test to work, the attached patch also touches FakeHTTPRequestHandler and TestUrlopen.urlopen.
msg81385 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-08 15:58
On the principle, the test looks good. If you want to avoid the 'if "%" in value' hack, you can use the named-parameter form of string formatting: >>> "localhost:%(port)s" % dict(port=8080) 'localhost:8080' >>> "localhost" % dict(port=8080) 'localhost'
msg81428 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-02-09 00:24
Antoine, Thanks for reviewing, here's an updated version.
msg81433 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-09 02:23
The test looks good to me. I can't comment on the bugfix patch, but if it's ok to you, you can go ahead :)
msg81610 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-10 22:31
I took a look at the patch and it looks ok, apart from the _checkClosed() hack (but I don't think there's any immediate solution). It should be noted that HTTPResponse.readline() will be awfully slow since, as HTTPResponse doesn't have peek(), readline() will call read() one byte at a time... (slow I/O is nothing new in py3k, however :-))
msg81611 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-10 23:03
Here is a patch without the _checkClosed() hack. The solution is simply to remove redundant _checkClosed() calls in IOBase (for example, readline() doesn't need to do an explicit `closed` check as it calls read()).
msg81617 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-11 01:02
Committed in r69513, r69514. Thanks everyone!
History
Date User Action Args
2022-04-11 14:56:42 admin set github: 48881
2009-02-11 01:02:33 pitrou set status: pending -> closedresolution: accepted -> fixedmessages: +
2009-02-11 00:39:28 pitrou set status: open -> pendingresolution: accepted
2009-02-10 23:03:48 pitrou set files: + urllib-chunked2.diffmessages: +
2009-02-10 22:31:04 pitrou set messages: +
2009-02-09 02:23:13 pitrou set messages: +
2009-02-09 00:24:38 ajaksu2 set files: - test_urllib_chunked.diff
2009-02-09 00:24:23 ajaksu2 set files: + test_urllib_chunked2.diffmessages: +
2009-02-08 15:58:59 pitrou set messages: +
2009-02-08 01:17:58 ajaksu2 set files: + test_urllib_chunked.diffmessages: +
2009-01-28 19:19:09 exarkun set nosy: - exarkun
2009-01-28 18:57:18 trodgers set nosy: + trodgers
2009-01-10 04:12:55 craigh set nosy: + craigh
2008-12-21 13:24:51 pitrou set messages: +
2008-12-21 09:43:26 loewis set priority: critical -> release blocker
2008-12-15 23:00:12 ajaksu2 set messages: +
2008-12-15 19:15:21 jhylton set files: + urllib-chunked.diffmessages: +
2008-12-15 04:01:47 jhylton set messages: +
2008-12-15 03:59:44 jhylton link issue3761 superseder
2008-12-14 22:00:56 jhylton set assignee: jhylton
2008-12-14 12:17:17 ajaksu2 set files: + urllib_bytes.diffkeywords: + patchmessages: +
2008-12-14 12:12:23 pitrou set nosy: + pitroumessages: +
2008-12-14 12:05:21 pitrou set priority: criticaltype: behavior
2008-12-14 11:31:53 ajaksu2 set messages: +
2008-12-14 10:20:45 ResulCetin set nosy: + ResulCetinmessages: +
2008-12-14 10:11:42 dato set messages: +
2008-12-14 10:08:12 ajaksu2 set messages: +
2008-12-14 10:02:43 dato set messages: +
2008-12-14 02:54:55 jhylton set nosy: + jhyltonmessages: +
2008-12-11 13:53:58 exarkun set nosy: + exarkunmessages: +
2008-12-11 13:13:23 ajaksu2 set nosy: + ajaksu2messages: +
2008-12-11 12:12:45 amaury.forgeotdarc set nosy: + amaury.forgeotdarcmessages: +
2008-12-11 11:23:01 dato create