Issue 3531: file read preallocs 'size' bytes which can cause memory problems (original) (raw)

Issue3531

Created on 2008-08-09 02:29 by dalke, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (12)
msg70924 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 02:29
I wrote a buggy PNG parser which ended up doing several file.read(large value). It causes a MemoryError, which was strange because the file was only a few KB long. I tracked it down to the implementation of read(). When given a size hint it preallocates the return string with that size. If the hint is for 10MB then the string returned will be preallocated fro 10MB, even if the actual read is empty. Here's a reproducible BLOCKSIZE = 10*1024*1024 f=open("empty.txt", "w") f.close() f=open("empty.txt") data = [] for i in range(10000): s = f.read(BLOCKSIZE) assert len(s) == 0 data.append(s) I wasn't sure if this is properly a bug, but since the MemoryError exception I got was quite unexpected and required digging into the source code to figure out, I'll say that it is.
msg70925 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 09:39
I can't reproduce, your code snippet works fine. What Python version is it?
msg70926 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 10:00
I tested it with Python 2.5 on a Mac, Python 2.5 on FreeBSD, and Python 2.6b2+ (from SVN as of this morning) on a Mac. Perhaps the memory allocator on your machine is making a promise it can't keep?
msg70927 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 10:15
Perhaps. I'm under Linux. However, at the end of the file_read() implementation in fileobject.c, you can find the following lines: if (bytesread != buffersize) _PyString_Resize(&v, bytesread); Which means that the string *is* resized at the end.
msg70930 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 11:26
You're right. I mistook the string implementation for the list one which does keep a preallocated section in case of growth. Strings of course don't grow so there's no need for that. I tracked the memory allocation all the way down to obmalloc.c:PyObject_Realloc . The call goes to realloc(p, nbytes) which is a C lib call. It appears that the memory space is not reallocated. That was enough to be able to find the python-dev thread "Darwin's realloc(...) implementation never shrinks allocations" from Jan. 2005, Bob Ippolito's post "realloc.. doesn’t?" (http://bob.pythonmac.org/archives/2005/01/01/realloc-doesnt/ ) and Issue1092502 . Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin specific.
msg70931 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 11:31
Le samedi 09 août 2008 à 11:26 +0000, Andrew Dalke a écrit : > Mind you, I also get the problem on FreeBSD 2.6 so it isn't Darwin > specific. Darwin and the BSD's supposedly share a lot of common stuff. But FreeBSD 2.6 is a bit old, isn't it?
msg70934 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-08-09 16:42
FreeBSD is why my hosting provider uses. Freebsd.org calls 2.6 "legacy" but the latest update was earlier this year. There is shared history with Macs. I don't know the details though. I just point out that the problem isn't only on Darwin.
msg73010 - (view) Author: Anthon van der Neut (anthon) * Date: 2008-09-11 09:37
FWIW: I have performance problems on Windows XP (SP2) with Python 2.5.1 that could be caused by this behaviour. My code regularily calculates the sha1 sum of 10.000 files and because in another reuse of the code had to deal with files too big to fit in memory I set a limit of 256Mb. It looks like that is now allocated and deallocated for every one of the 10.000 files, making things *very* slow.
msg73022 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-11 13:23
> My code regularily calculates the sha1 sum of 10.000 files and because > in another reuse of the code had to deal with files too big to fit in > memory I set a limit of 256Mb. Why don't you use a sensible buffer size, e.g. 1MB? Reading data in 256MB chunks sounds foolish in any case.
msg73023 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-11 13:28
Andrew, as for memory reallocation issues, you may take a look at #3526 where someone has similar problems on SunOS. If nobody objects, I will close the present bug as invalid.
msg73030 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-11 14:07
Le jeudi 11 septembre 2008 à 16:01 +0200, Anthon van der Neut a écrit : > The thing however was resolved by reading multiple smaller chunks indeed > 1Mb if the filesize exceeds 1Mb (in the latter case the original read() > is done. It's too complicated. Just use chunks in all cases (even small files) and you are done. There should be no visible performance downside to doing so. Using fixed-size chunks to read binary data from a file of an unknown size isn't a Python-specific idiom, really. It's the same in C, C++, PHP, etc.
msg73066 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2008-09-11 22:58
I'm still undecided on if this is a bug or not. The problem occurs even when I'm not reading "data from a file of an unknown size." My example causes a MemoryError on my machine even though the file I'm reading contains 0 bytes. The problem is Python's implementation is "alloc the requested bytes and truncate if needed" vs what I expected "read chunks at a time up to the requested number of bytes." There's nothing in the documentation which states the implementation, although "Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible." leans slightly towards my interpretation. I looked a little for real-world cases that could cause a denial-of- service attack but didn't find one. If there is a problem, it will occur very rarely. Go ahead an mark it as "will not fix" or something similar. I don't think the change in the code is justifiable.
History
Date User Action Args
2022-04-11 14:56:37 admin set github: 47781
2008-09-13 21:10:47 pitrou set status: open -> closedresolution: rejected
2008-09-11 22:58:40 dalke set messages: +
2008-09-11 14:07:12 pitrou set messages: + title: file read preallocs 'size' bytes which can cause memory problems -> file read preallocs 'size' bytes which can cause memory problems
2008-09-11 13:28:23 pitrou set messages: +
2008-09-11 13:23:48 pitrou set messages: +
2008-09-11 09:37:49 anthon set nosy: + anthonmessages: +
2008-08-09 16:42:06 dalke set messages: +
2008-08-09 11:31:41 pitrou set messages: +
2008-08-09 11:26:49 dalke set messages: +
2008-08-09 10:15:08 pitrou set messages: +
2008-08-09 10:00:09 dalke set messages: +
2008-08-09 09:39:07 pitrou set nosy: + pitroumessages: +
2008-08-09 02:29:44 dalke create