Issue 1521: string.decode() fails on long strings (original) (raw)

Issue1521

Created on 2007-11-29 15:33 by eisele, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
getargs.patch amaury.forgeotdarc,2007-11-29 22:56
Messages (16)
msg57932 - (view) Author: Andreas Eisele (eisele) Date: 2007-11-29 15:33
s.decode("utf-8") sometimes silently truncates the result if s has more than 2E9 Bytes, sometimes raises a fairly incomprehensible exception: Traceback (most recent call last): File "", line 2, in File "/usr/lib64/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) TypeError: utf_8_decode() argument 1 must be (unspecified), not str
msg57934 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-11-29 16:11
Can you attach a (small) example that demonstrates the bug?
msg57935 - (view) Author: Andreas Eisele (eisele) Date: 2007-11-29 16:15
For instance: Python 2.5.1 (r251:54863, Aug 30 2007, 16:15:51) [GCC 4.1.0 (SUSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. __[1] >>> s=" "*int(5E9) 6.050000 sec __[1] >>> u=s.decode("utf-8") 4.710000 sec __[1] >>> len(u) 705032704 __[2] >>> len(s) 5000000000 __[3] >>> I would have expected both lengths to be 5E9
msg57936 - (view) Author: Andreas Eisele (eisele) Date: 2007-11-29 16:20
An instance of the other problem: Python 2.5.1 (r251:54863, Aug 30 2007, 16:15:51) [GCC 4.1.0 (SUSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. __[1] >>> s=" "*int(25E8) 2.990000 sec __[1] >>> u=s.decode("utf-8") Traceback (most recent call last): File "", line 1, in File "/home/cl-home/eisele/lns-root-07/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) TypeError: utf_8_decode() argument 1 must be (unspecified), not str __[1] >>>
msg57938 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-29 17:14
I don't have any 64bit machine to test with, but it seems to me that there is a problem in the function getargs.c::convertsimple(): the t# and w# formats use the buffer interface, but the code uses an int to store its length! Look for the variables declared as "int count;". I suggest to replace it with a Py_ssize_t in both places. Shouldn't the compiler emit some warning in this case?
msg57962 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-29 22:56
Here is a patch, with a unit test (I was surprised that test_bigmem.py already contained a test_decode function, which was left empty). But I still don't have access to any 64bit machine. Can someone try and see if the new tests in test_bigmem.py fail, and that the patch in getargs.c corrects the problem?
msg57969 - (view) Author: Andreas Eisele (eisele) Date: 2007-11-30 09:36
Thanks a lot for the patch, which indeed seems to solve the issue. Alas, the extended test code still does not catch the problem, at least in my installation. Someone with a better understanding of how these tests work and with access to a 64bit machine should still have a look.
msg57970 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-30 09:58
> Alas, the extended test code still does not catch the problem Can you please try again by changing in the tests: minsize=_2G into minsize=_2G * 2 + 2 The length has to be greater than 4G for an int to loose digits.
msg57972 - (view) Author: Andreas Eisele (eisele) Date: 2007-11-30 10:21
Tried @bigmemtest(minsize=_2G*2+2, memuse=3) but no change; the test is done only once with a small size (5147). Apparently something does not work as expected here. I'm trying this with 2.6 (Revision 59231).
msg57973 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-30 11:00
> the test is done only once with a small size (5147) How do you run the test? Do you specify a maximum available size? If you run test_bigmem.py directly, try to run it with an additional argument like this: ./test_bigmem.py 7G If you run regrtest.py, you should add an option like "-M 7G". (assuming you have enough RAM...)
msg57993 - (view) Author: Andreas Eisele (eisele) Date: 2007-11-30 17:49
> How do you run the test? Do you specify a maximum available size? I naively assumed that running "make test" from the toplevel would be clever about finding plausible parameters. However, it runs the bigmem tests in a minimalistic way, skipping essentially all interesting bits. Thanks for the hints on giving the maximal available size explicitly, which work in principle, but make testing rather slow. Also, if the encode/decode test are decorated with @bigmemtest(minsize=_2G*2+2, memuse=3) one needs to specify at least -M 15g, otherwise the tests are still skipped. No wonder that people do not normally run them...
msg57994 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-30 17:56
> @bigmemtest(minsize=_2G*2+2, memuse=3) minsize=_2G + 2 should trigger your second problem (where the size wraps to a negative number). Then 7G is "enough" for the test to run.
msg57995 - (view) Author: Andreas Eisele (eisele) Date: 2007-11-30 18:05
> Then 7G is "enough" for the test to run. yes, indeed, thanks for pointing this out. It runs and detects an ERROR, and after applying your patch it succeeds. What else needs to be done to make sure your patch finds it's way to the Python core?
msg57996 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-30 18:15
> What else needs to be done to make sure your patch finds it's way > to the Python core? Nothing I suppose. It appears like an inconsistency in the source code, and it happens to correct a real problem. I will commit it in a few hours.
msg58008 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-30 20:55
Committed revision 59241. Will backport after the buildbots run the test.
msg58015 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-11-30 21:55
Committed revision 59244 in release25-maint.
History
Date User Action Args
2022-04-11 14:56:28 admin set github: 45862
2007-11-30 21:55:07 amaury.forgeotdarc set status: open -> closedresolution: fixedmessages: +
2007-11-30 20:55:30 amaury.forgeotdarc set assignee: amaury.forgeotdarcmessages: +
2007-11-30 18:15:50 amaury.forgeotdarc set messages: +
2007-11-30 18:05:49 eisele set messages: +
2007-11-30 17:56:07 amaury.forgeotdarc set messages: +
2007-11-30 17:49:40 eisele set messages: +
2007-11-30 11:00:03 amaury.forgeotdarc set messages: +
2007-11-30 10:21:01 eisele set messages: +
2007-11-30 09:58:56 amaury.forgeotdarc set messages: +
2007-11-30 09:36:05 eisele set messages: +
2007-11-29 22:56:16 amaury.forgeotdarc set files: + getargs.patchmessages: +
2007-11-29 17:14:52 amaury.forgeotdarc set nosy: + amaury.forgeotdarcmessages: +
2007-11-29 16:20:57 eisele set messages: +
2007-11-29 16:15:22 eisele set messages: +
2007-11-29 16:11:20 doerwalter set nosy: + doerwaltermessages: +
2007-11-29 15:33:06 eisele create