[Python-Dev] IO module improvements (original) (raw)
Pascal Chambon chambon.pascal at gmail.com
Sat Feb 6 12:43:08 CET 2010
- Previous message: [Python-Dev] IO module improvements
- Next message: [Python-Dev] IO module improvements
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Antoine Pitrou a écrit :
What is the difference between "file handle" and a regular C file descriptor? Is it some Windows-specific thing? If so, then perhaps it deserves some Windows-specific attribute ("handle"?). At the moment it's windows-specific, but it's not impossible that some other OSes also rely on specific file handles (only emulating C file descriptors for compatibility). I've indeed mirrored the fileno concept, with a "handle" argument for constructors, and a handle() getter.
On Fri, Feb 5, 2010 at 5:28 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
Pascal Chambon <pythoniks gmail.com> writes:
By the way, I'm having trouble with the "name" attribute of raw files, which can be string or integer (confusing), ambiguous if containing a relative path, Why is it ambiguous? It sounds like you're using str() of the name and then can't tell whether the file is named e.g. '1' or whether it refers to file descriptor 1 (i.e. sys.stdout). As Jean-Paul mentioned, I find confusing the fact that it can be a relative path, and sometimes not a path at all. I'm pretty sure many programmers haven't even cared in their library code that it could be a non-string, using concatenation etc. on it... However I guess that the history is so high on it, that I'll have to conform to this semantic, putting all paths/fileno/handle in the same "name" property, and adding an "origin" property telling how to interpret the "name"...
Methods too would deserve some auto-forwarding. If you want to bufferize a raw stream which also offers size(), times(), lockfile() and other methods, how can these be accessed from a top-level buffering/text stream ?
I think it's a bad idea. If you forget to implement one of the standard IO methods (e.g. seek()), it will get forwarded to the raw stream, but with the wrong semantics (because it won't take buffering into account). It's better to require the implementor to do the forwarding explicitly if desired, IMO. The problem is, doing that forwarding is quite complicated. IO is a collection of "core tools for working with streams", but it's currently not flexible enough to let people customize them too... For example, if I want to add a new series of methods to all standard streams, which simply forward calls to new raw stream features, what do I do ? Monkey-patching base classes (RawFileIO, BufferedIOBase...) ? Not a good pattern. Subclassing FileIO+BufferedWriter+BufferredReader+BufferedRandom+TextIOWrapper ? That's really redundant...
And there are sepecially flaws around BufferedRandom. This stream inherits BufferedWriter and BufferedRandom, and overrides some methods. How do I do to extend it ? I'd want to reuse its methods, but then have it forward calls to MY buffered classes, not original BufferedWriter or BufferredReader classes. Should I modify its bases to edit the inheritance tree ? Handy but not a good pattern... I'm currently getting what I want with a triple inheritance (praying for the MRO to be as I expect), but it's really not straightforward. Having BufferedRandom as an additional layer would slow down the system, but allow its reuse with custom buffered writers and readers...
- I feel thread-safety locking and stream stream status checking are currently overly complicated. All methods are filled with locking calls and CheckClosed() calls, which is both a performance loss (most io streams will have 3 such levels of locking, when 1 would suffice)
FileIO objects don't have a lock, so there are 2 levels of locking at worse, not 3 (and, actually, TextIOWrapper doesn't have a lock either, although perhaps it should). As for the checkClosed() calls, they are probably cheap, especially if they bypass regular attribute lookup. CheckClosed calls are cheap, but they can easily be forgotten in one of the dozens of methods involved... My own FileIO class alas needs locking, because for example, on windows truncating a file means seeking + setting end of file + restoring pointer. And I TextIOWrapper seems to deserve locks. Maybe excerpts like this one really are thread-safe, but a long study would be required to ensure it.
if whence == 2: # seek relative to end of file
if cookie != 0:
raise IOError("can't do nonzero end-relative seeks")
self.flush()
position = self.buffer.seek(0, 2)
self._set_decoded_chars('')
self._snapshot = None
if self._decoder:
self._decoder.reset()
return position
Since we're anyway in a mood of imbricating streams, why not simply adding a "safety stream" on top of each stream chain returned by open() ? That layer could gracefully handle mutex locking, CheckClosed() calls, and even, maybe, the attribute/method forwarding I evocated above.
It's an interesting idea, but it could also end up slower than the current situation. First because you are adding a level of indirection (i.e. additional method lookups and method calls). Second because currently the locks aren't always taken. For example, in BufferedIOReader, we needn't take the lock when the requested data is available in our buffer (the GIL already protects us). Having a separate "synchronizing" wrapper would forbid such micro-optimizations. If you want to experiment with this, you can use iobench (in the Tools directory) to measure file IO performance. There are chances that my approach is slower, but the gains are so high in terms of maintainability and use of use, that I would definitely advocate it. Typically, the micro-optimizations you speak about can please heavy programs, but they make code a mined land (maybe that's why they haven't been put into _pyio :p). When the order of every instruction matters, when all is carefully crafted so that the Gil is sufficient, I personally don't dare touching anything anymore...
There is for sure an important trade-off between speed and robustness here, but I fear speed has won too much so far (and now that the main implementation is in C, it's getting real hard to apprehend).
Maybe I should take the latest _pyio version, and make a fork offering high level flexibility and security, for those who don't care about so high performances ?
- some semantic decisions of the current system are somehow dangerous. For example, flushing errors occuring on close are swallowed. It seems to me that it's of the utmost importance that the user be warned if the bytes he wrote disappeared before reaching the kernel ; shouldn't we decidedly enforce a "don't hide errors" everywhere in the io module ?
It may be a bug. Can you report it, along with a script or test showcasing it? Regards Antoine. It seems a rather decided semantic (with comments like "#If flush() fails, just give up"), but yep I'll file a bug to be sure.
I don't think this can be helped though -- I really don't want open() to be slowed down or complicated by an attempt to do path manipulation. If this matters to the app author they should use os.path.abspath() or os.path.realpath() or whatever before calling open().
On second thought, having more precise "name" or "path" attributes might give users the impression that they can rely on them, whereas indeed the filesystem might have been modified a lot during the use of the stream (even on windows, where files can actually be renamed/deleted while they're open)...
AFAIK, they aren't simple indexes in windows, and that's partly why even file descriptors cannot be safely passed between C runtimes on windows (whereas they can in most unices).
David Yep, windows file descriptors are actually emulated (with bugs...) on top of native file handles, that's why we can't rely on them for advanced stream operations.
Regards, Pascal
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20100206/fb5cb12c/attachment.htm>
- Previous message: [Python-Dev] IO module improvements
- Next message: [Python-Dev] IO module improvements
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]