Issue 4565: Rewrite the IO stack in C (original) (raw)
Created on 2008-12-06 15:10 by ialbert, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (43)
Author: Istvan Albert (ialbert)
Date: 2008-12-06 15:10
The write performance into text files is substantially slower (5x-8x) than that of python 2.5. This makes python 3.0 unsuited to any application that needs to write larger amounts of data.
------------test code follows -----------------------
import time
lo, hi, step = 105, 106, 10**5
writes increasingly more lines to a file
for N in range(lo, hi, step): fp = open('foodata.txt', 'wt') start = time.time() for i in range( N ): fp.write( '%s\n' % i) fp.close() stop = time.time() print ( "%s\t%s" % (N, stop-start) )
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *
Date: 2008-12-06 15:40
This is expected: the I/O stack has been completely rewritten... in almost pure-python code.
There is a project to rewrite it in C. It started at http://svn.python.org/view/sandbox/trunk/io-c/
Author: Istvan Albert (ialbert)
Date: 2008-12-06 18:26
Well I would strongly dispute that anyone other than the developers expected this. The release documentation states:
"The net result of the 3.0 generalizations is that Python 3.0 runs the pystone benchmark around 10% slower than Python 2.5."
There is no indication of an order of magnitudes in read/write slowdown. I believe that this issue is extremely serious! IO is an essential part of a program, and today we live in the world of gigabytes of data. I am reading reports of even more severe io slowdowns than what I saw:
http://bugs.python.org/issue4561
Java has had a hard time getting rid of the "it is very slow" stigma even after getting a JIT compiler, so there is a danger there for a lasting negative impression.
Author: Antoine Pitrou (pitrou) *
Date: 2008-12-06 21:52
Hi Amaury,
There is a project to rewrite it in C
Thanks for publicizing this. I'm a bit surprised by the adopted approach. It seems you are merely translating the Python code into C. I think the proper approach for the buffered IO classes would be to use a fixed-size buffer which never gets reallocated.
If you look at bufferedwriter2.patch in #3476, I had rewritten BufferedWriter using a fixed-size buffer (although not for performance reasons), I think it would be a good starting point for a C implementation.
Author: Christian Heimes (christian.heimes) *
Date: 2008-12-06 22:01
For more bug reports see #4533 and #4561.
I suggest we close this bug report as duplicate and keep the discussion in #4561.
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *
Date: 2008-12-06 22:48
I'm a bit surprised by the adopted approach. It seems you are merely translating the Python code into C. I think the proper approach for the buffered IO classes would be to use a fixed-size buffer which never gets reallocated.
You are certainly right, but the code io.py is already difficult to understand and maintain; the corresponding C code adds one level of complexity; had I changed the buffering strategy at the same time, it would have been impossible to ensure a correct implementation. Now that my C implementation of the Buffered classes seems correct (all tests pass, except a few about destructors) we could try alternative approaches.
Author: Antoine Pitrou (pitrou) *
Date: 2008-12-20 12:46
We can't solve this for 3.0.1, downgrading to critical.
Author: Antoine Pitrou (pitrou) *
Date: 2009-01-18 12:11
The work to rewrite the IO stack in C will solve this problem as it will probably solve most performance-related IO problems in py3k.
Amaury and I have been progressing a lot, the rewrite is now a real branch in SVN at branches/io-c/. On this very issue, it is only 30% slower than 2.x, which is quite good given the layered nature of the IO stack and the fact that text IO does a lot more than in 2.x (it translates newlines and encodes the text).
(actually, if I add an explicit .encode('utf8') call to the 2.x version of the script, it becomes slower than our io-c rewrite)
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-14 21:09
Issue depends on #4967 which blocks use of memoryview objects with the _ssl module.
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-19 03:36
This is basically going to be the killer feature in 3.1 ;). Therefore, these are steps I think we need before we can merge the branch:
- Fix the dependencies. (#4967)
- Resolve all outstanding issues with the IO lib on the io-c branch.
- Rewrite the rest of StringIO in C?
- Anything else I forgot?
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-19 13:55
After "rewrite the rest of StringIO in C", there's "sanitize the destructor behaviour of IOBase (if at all possible)".
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-19 13:57
Oh, and "what to do of the now unused pure Python implementations in io.py"? Easiest would be to dump them, as they will probably get hopelessly out of sync, but perhaps there's some genuine portability/educational advantage to keep them?
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-19 19:47
I think we should just drop the Python implementations. There's no point in trying to keep two implementations around.
Besides, if we don't backport IO in C, we can maintain them in the trunk. :)
Author: Jean-Paul Calderone (exarkun) *
Date: 2009-02-19 19:57
Oh, and "what to do of the now unused pure Python implementations in io.py"? Easiest would be to dump them, as they will probably get hopelessly out of sync, but perhaps there's some genuine portability/educational advantage to keep them?
The test suite should be run against both implementations. That way tested behavior will always be the same for both. And all of its behavior is tested, right? ;)
The value in the Python implementation is manifold. For example:
- It eases testing of new features/techniques. Rather than going straight to the C version when someone has an idea for a feature, it can be implemented and tried out in Python. If it's cool, then the extra effort of porting to C can be undertaken.
- It helps other Python implementations immensely. PyPy, IronPython, and Jython are all going to have to provide this library eventually (one supposes). Forcing them each to re-implement it will mean it will be that much longer before they support it.
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-19 20:01
On Thu, Feb 19, 2009 at 1:57 PM, Jean-Paul Calderone <report@bugs.python.org> wrote:
Jean-Paul Calderone <exarkun@divmod.com> added the comment:
Oh, and "what to do of the now unused pure Python implementations in io.py"? Easiest would be to dump them, as they will probably get hopelessly out of sync, but perhaps there's some genuine portability/educational advantage to keep them?
The test suite should be run against both implementations. That way tested behavior will always be the same for both. And all of its behavior is tested, right? ;)
The value in the Python implementation is manifold. For example:
- It eases testing of new features/techniques. Rather than going straight to the C version when someone has an idea for a feature, it can be implemented and tried out in Python. If it's cool, then the extra effort of porting to C can be undertaken.
- It helps other Python implementations immensely. PyPy, IronPython, and Jython are all going to have to provide this library eventually (one supposes). Forcing them each to re-implement it will mean it will be that much longer before they support it.
We don't maintain any other features in two languages for those purposes. IMO, it will just be more of a burden to fix bugs in two different places as compared to the advantages you mention.
Author: Jean-Paul Calderone (exarkun) *
Date: 2009-02-19 20:05
We don't maintain any other features in two languages for those purposes. IMO, it will just be more of a burden to fix bugs in two different places as compared to the advantages you mention.
Surely the majority of the burden is imposed by the C implementation. I expect that 90% of the time spent fixing bugs will be spent fixing them in C. So for only a slightly increased maintenance cost, a massive advantage is gained for other Python implementations. If the general well-being and popularity of Python isn't a concern of CPython developers, then perhaps the benefits can still be preserved at minimal cost to the CPython developers by letting some Jython, IronPython, or PyPy developers maintain the Python implementation of the io library in the CPython source tree (rather than making them copy it elsewhere where it will more frequently get out of sync, and where Jython/IronPython/PyPy might waste effort in duplicating maintenance).
Or maybe none of them will care or object to the removal of the Python version from CPython. It might at least be worth asking first, though.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-19 22:10
Hello JP,
Surely the majority of the burden is imposed by the C implementation. I expect that 90% of the time spent fixing bugs will be spent fixing them in C.
Hmm, it depends. It's probably true in general, but I suspect a fair amount of work also went into getting the Python implementation correct, since there are things in there that are tricky regardless of the implementation language (I'm especially thinking of the TextIOWrapper seek() and tell() methods). (and there are still bugs in the Python implementation btw.)
If the general well-being and popularity of Python isn't a concern of CPython developers, then perhaps the benefits can still be preserved at minimal cost to the CPython developers by letting some Jython, IronPython, or PyPy developers maintain the Python implementation of the io library in the CPython source tree
Well, if it is part of the CPython source tree, we (CPython developers) can't realistically ignore it by saying it's someone else's job.
Or maybe none of them will care or object to the removal of the Python version from CPython. It might at least be worth asking first, though.
In any case, it must first be asked on python-dev. We're not gonna dump the code without telling anybody anything :)
cheers
Antoine.
Author: Raymond Hettinger (rhettinger) *
Date: 2009-02-19 22:32
[Benjamin Peterson]
I think we should just drop the Python implementations. There's no point in trying to keep two implementations around.
I disagree. I've found great value in keeping a pure python version around for things I've converted to C. The former serves as documentation, as a tool for other implementations (like PyPy IronPython, and Jython), and as a precise spec. The latter case is especially valuable (otherwise, the spec becomes whatever CPython happens to do).
Also, I've found that once the two are in-sync, keeping it that way isn't hard. And, there effort for keeping them in-sync is a good way to find bugs.
In the heapqmodule, we do a little magic in the test suite to make sure the tests are run against both. It's not hard.
Raymond
Author: Jean-Paul Calderone (exarkun) *
Date: 2009-02-19 22:34
Hi Antoine,
Surely the majority of the burden is imposed by the C implementation. I expect that 90% of the time spent fixing bugs will be spent fixing them in C.
Hmm, it depends. It's probably true in general, but I suspect a fair amount of work also went into getting the Python implementation correct, since there are things in there that are tricky regardless of the implementation language (I'm especially thinking of the TextOWrapper seek() and tell() methods). (and there are still bugs in the Python implementation btw.)
Indeed, I'm sure a lot of work went into the Python implementation - and hopefully that work saved a huge amount of work when doing the C implementation. That's why people prototype things in Python, right? :) So it seems to me that keeping the Python implementation is useful for CPython, since if it made working on the C implementation easier in the past, it will probably do so again in the future.
Basically, my point is that maintaining C and Python versions is cheaper than just maintaining the C version alone. The stuff I said about other VMs is true too, but it doesn't seem like anyone here is going to be convinced by it ;) (and I haven't spoked to any developers for other VMs about whether they really want it, anyway).
Author: Raymond Hettinger (rhettinger) *
Date: 2009-02-19 23:30
Basically, my point is that maintaining C and Python versions is cheaper than just maintaining the C version alone.
Well said.
Author: Gregory P. Smith (gregory.p.smith) *
Date: 2009-02-19 23:39
+1 to setting it up so that unit tests are always run against both and keeping both.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-19 23:49
+1 to setting it up so that unit tests are always run against both and keeping both.
If this is the way forward I recommend putting the pure Python versions into a separate module, eg pyio.py (although the name is not very elegant). It will make the separation clean and obvious.
(and perhaps it will have the side-effect of improving startup time, although I'm not really worried about this)
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-20 19:27
It seems the decision of Python-dev is to keep both implementations. We'll stuff the python one in _pyio and rewrite the tests to test both. I'll see if I can get to this this weekend.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-21 19:12
The StringIO rewrite is finished now.
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-21 20:09
Ok. I've split the Python io implementation into the _pyio module and rewritten the tests. All the C ones are passing, but some Python implementation ones are failing.
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-22 01:13
Ok. I've fixed all the tests except PyBufferedRandomTest.testFlushAndPeek and the garbage collections ones.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-22 19:50
What should we do about test_fileio, test_file and test_bufio?
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-22 21:31
On Sun, Feb 22, 2009 at 1:50 PM, Antoine Pitrou <report@bugs.python.org> wrote:
Antoine Pitrou <pitrou@free.fr> added the comment:
What should we do about test_fileio, test_file and test_bufio?
I changed test_file and test_bufio to test the open() implementations of each library. test_fileio should be fine because the implementation is the same for _pyio and io.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-22 23:00
There's also test_univnewlines, I think.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-23 20:12
Oh, and test_largefile and test_debussy as well :)
Le dimanche 22 février 2009 à 23:00 +0000, Antoine Pitrou a écrit :
Antoine Pitrou <pitrou@free.fr> added the comment:
There's also test_univnewlines, I think.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-23 20:26
test_largefile is done. One more question: what shall we do with _pyio.OpenWrapper? Should it become the default exported "open" object?
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-24 03:11
On Mon, Feb 23, 2009 at 2:26 PM, Antoine Pitrou <report@bugs.python.org> wrote:
test_largefile is done.
Thanks.
One more question: what shall we do with _pyio.OpenWrapper? Should it become the default exported "open" object?
No, I think it was just meant to be used when _pyio is the builtin open implementation.
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-24 19:17
We also have to figure out how to make the C IOBase a ABC, so people can implement it.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-24 19:51
We also have to figure out how to make the C IOBase a ABC, so people can implement it.
Mmmh, I know absolutely nothing about the ABC implementation.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-25 15:14
I just took a quick look at Lib/abc.py and there's no way I'll reimplement it in C :)
The only workable approach would be:
- rename the current would-be ABCs (IOBase, RawIOBase, etc.) with a leading underscore (_IOBase, _RawIOBase, etc.)
- call abc.ABCMeta() with the right arguments to create heap-types derived from those base types
- call XXXIOBase.register() with each of the concrete classes (BufferedReader, etc.) to register them with the ABCs created in 2
That is, do something like the following:
IOBase = abc.ABCMeta("IOBase", (_io.IOBase,), {}) RawIOBase = type("RawIOBase", (_io.RawIOBase, IOBase), {}) RawIOBase.register(_io.FileIO) TextIOBase = type("TextIOBase", (_io.TextIOBase, IOBase), {}) TextIOBase.register(_io.TextIOWrapper)
Which gives:
f = open('foobar', 'wb', buffering=0) isinstance(f, RawIOBase) True isinstance(f, IOBase) True f = open('foobar', 'w') isinstance(f, IOBase) True isinstance(f, TextIOBase) True isinstance(f, RawIOBase) False
As you see, RawIOBase inherits both from IOBase (the ABC, for ABC-ness) and _RawIOBase (the concrete non-ABC implementation). Implementation classes like FileIO don't need to explicitly inherit the ABCs, only to register with them.
Also, writing a Python implementation still inherits the close-on-destroy behaviour:
class S(RawIOBase): ... def close(self): ... print("closing") ... s = S() del s closing
Perhaps we could even do all this in Python in io.py?
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-25 19:34
On Wed, Feb 25, 2009 at 10:15 AM, Antoine Pitrou <report@bugs.python.org> wrote:
Antoine Pitrou <pitrou@free.fr> added the comment:
I just took a quick look at Lib/abc.py and there's no way I'll reimplement it in C :)
I don't blame you for that. :)
The only workable approach would be:
- rename the current would-be ABCs (IOBase, RawIOBase, etc.) with a leading underscore (_IOBase, _RawIOBase, etc.)
- call abc.ABCMeta() with the right arguments to create heap-types derived from those base types
- call XXXIOBase.register() with each of the concrete classes (BufferedReader, etc.) to register them with the ABCs created in 2
I think this is the best solution. We could also just move the Python ABC's from _pyio to io.py and register() all the C IO classes, but that would prevent the C implementation of IOBase from being used.
Author: Antoine Pitrou (pitrou) *
Date: 2009-02-27 22:37
Ok, so the ABC stuff is done now. Remaining:
- fix the test failures with the Python implementation
- the _ssl bug
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-28 17:12
I just fixed the last failing test_io.
(I'm listing as dependencies issues we can close after the branch is merged.)
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-02-28 17:29
These StringIO bugs should be dealt with:
Author: Antoine Pitrou (pitrou) *
Date: 2009-03-03 00:24
Reviewers: ,
Description: The diff between the py3k and io-c branches, for whoever wants to review it.
Please review this at http://codereview.appspot.com/22061
Affected files: Doc/library/io.rst Lib/_pyio.py Lib/importlib/init.py Lib/importlib/_bootstrap.py Lib/io.py Lib/test/test_bufio.py Lib/test/test_descr.py Lib/test/test_file.py Lib/test/test_fileio.py Lib/test/test_io.py Lib/test/test_largefile.py Lib/test/test_memoryio.py Lib/test/test_univnewlines.py Lib/test/test_uu.py Makefile.pre.in Modules/Setup.dist Modules/_bufferedio.c Modules/_bytesio.c Modules/_fileio.c Modules/_iobase.c Modules/_iomodule.h Modules/_stringio.c Modules/_textio.c Modules/io.c PC/VC6/pythoncore.dsp PC/config.c PCbuild/pythoncore.vcproj Python/pythonrun.c setup.py
Author: Daniel Diniz (ajaksu2) *
Date: 2009-03-03 21:09
A couple of typos in the Python implementation.
http://codereview.appspot.com/22061/diff/1/11 File Lib/_pyio.py (right):
http://codereview.appspot.com/22061/diff/1/11#newcode266 Line 266: fp is closed after the suite of the with statment is complete: statment -> statement
http://codereview.appspot.com/22061/diff/1/11#newcode844 Line 844: self._reset_read_buf() Setting "_read_buf" and "_read_pos" directly on init may help introspection tools.
http://codereview.appspot.com/22061/diff/1/11#newcode963 Line 963: DEAFULT_BUFFER_SIZE. If max_buffer_size is omitted, it defaults to DEAFULT_BUFFER_SIZE -> DEFAULT_BUFFER_SIZE
http://codereview.appspot.com/22061/diff/1/11#newcode1728 Line 1728: decoder = self._decoder or self._get_decoder() 'decoder' isn't used in this method, is this here for an useful side-effect?
http://codereview.appspot.com/22061/diff/1/11#newcode1784 Line 1784: more_line = '' This seems unused.
http://codereview.appspot.com/22061
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-03-03 21:47
2009/3/3 Daniel Diniz <report@bugs.python.org>:
A couple of typos in the Python implementation.
Thanks for taking a look! Fixed these things in r70135.
http://codereview.appspot.com/22061/diff/1/11#newcode844 Line 844: self._reset_read_buf() Setting "_read_buf" and "_read_pos" directly on init may help introspection tools.
Perhaps, but I think it duplicates too much of _reset_read_buf(). And it wouldn't damage introspection, just static analysis.
http://codereview.appspot.com/22061/diff/1/11#newcode1728 Line 1728: decoder = self._decoder or self._get_decoder() 'decoder' isn't used in this method, is this here for an useful side-effect?
Yes, it's for side affect, but it needn't be in a variable.
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2009-03-04 21:29
And the io-c branch has been merged in r70152.
History
Date
User
Action
Args
2022-04-11 14:56:42
admin
set
github: 48815
2009-03-04 21:30:46
benjamin.peterson
set
dependencies: - possible deadlock in python IO implementation
2009-03-04 21:29:08
benjamin.peterson
set
status: open -> closed
resolution: fixed
messages: +
2009-03-03 21:47:43
benjamin.peterson
set
messages: +
2009-03-03 21:09:34
ajaksu2
set
nosy: + ajaksu2
messages: +
2009-03-03 00:24:24
pitrou
set
messages: +
2009-02-28 17:29:01
benjamin.peterson
set
messages: +
2009-02-28 17:25:56
benjamin.peterson
set
dependencies: + utf-16 BOM is not skipped after seek(0), Duplicate UTF-16 BOM if a file is open in append mode
2009-02-28 17:14:36
benjamin.peterson
set
dependencies: + possible deadlock in python IO implementation
2009-02-28 17:12:32
benjamin.peterson
set
dependencies: + BufferedWriter non-blocking overage, io.TextIOWrapper calls buffer.read1()
messages: +
2009-02-27 22:37:55
pitrou
set
messages: +
2009-02-25 19:34:49
benjamin.peterson
set
messages: +
2009-02-25 15:15:01
pitrou
set
messages: +
2009-02-24 19:51:25
pitrou
set
messages: +
2009-02-24 19:17:00
benjamin.peterson
set
messages: +
2009-02-24 03:11:26
benjamin.peterson
set
messages: +
2009-02-23 20:26:11
pitrou
set
messages: +
2009-02-23 20:12:50
pitrou
set
messages: +
2009-02-22 23:00:36
pitrou
set
messages: +
2009-02-22 21:31:59
benjamin.peterson
set
messages: +
2009-02-22 19:50:52
pitrou
set
messages: +
2009-02-22 01:13:11
benjamin.peterson
set
messages: +
2009-02-21 20:09:06
benjamin.peterson
set
messages: +
2009-02-21 19:12:11
pitrou
set
messages: +
2009-02-20 19:27:46
benjamin.peterson
set
messages: +
2009-02-19 23:49:07
pitrou
set
messages: +
2009-02-19 23:39:48
gregory.p.smith
set
nosy: + gregory.p.smith
messages: +
2009-02-19 23:30:21
rhettinger
set
messages: +
2009-02-19 22:34:31
exarkun
set
messages: +
2009-02-19 22:32:19
rhettinger
set
nosy: + rhettinger
messages: +
2009-02-19 22:10:00
pitrou
set
messages: +
2009-02-19 20:05:52
exarkun
set
messages: +
2009-02-19 20:01:12
benjamin.peterson
set
messages: +
2009-02-19 19:57:34
exarkun
set
nosy: + exarkun
messages: +
2009-02-19 19:47:46
benjamin.peterson
set
messages: +
2009-02-19 13:57:30
pitrou
set
messages: +
2009-02-19 13:55:41
pitrou
set
messages: +
2009-02-19 03:36:02
benjamin.peterson
set
messages: +
2009-02-17 02:05:35
benjamin.peterson
set
nosy: + benjamin.peterson
2009-02-14 21:09:33
pitrou
set
assignee: amaury.forgeotdarc ->
dependencies: + Bugs in _ssl object read() when a buffer is specified
messages: +
2009-01-18 14:25:43
pitrou
unlink
2009-01-18 12:33:27
pitrou
link
2009-01-18 12:13:00
pitrou
link
2009-01-18 12:11:42
pitrou
set
title: io write() performance very slow -> Rewrite the IO stack in C
stage: needs patch
messages: +
components: + Extension Modules, Library (Lib), - Interpreter Core
versions: + Python 3.1, - Python 3.0
2008-12-20 12:46:42
pitrou
set
priority: release blocker -> critical
messages: +
2008-12-20 02:40:48
loewis
set
priority: deferred blocker -> release blocker
2008-12-10 08:24:26
loewis
set
priority: release blocker -> deferred blocker
2008-12-07 08:03:07
wplappert
set
nosy: + wplappert
2008-12-06 23:39:39
barry
set
priority: high -> release blocker
2008-12-06 22:48:37
amaury.forgeotdarc
set
messages: +
2008-12-06 22:01:04
christian.heimes
set
nosy: + christian.heimes
messages: +
2008-12-06 21:52:37
pitrou
set
nosy: + pitrou
messages: +
2008-12-06 21:36:25
giampaolo.rodola
set
nosy: + giampaolo.rodola
2008-12-06 18:26:58
ialbert
set
messages: +
2008-12-06 15:40:25
amaury.forgeotdarc
set
priority: high
assignee: amaury.forgeotdarc
messages: +
nosy: + amaury.forgeotdarc
2008-12-06 15:10:59
ialbert
create