[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator (original) (raw)

Gregory P. Smith greg at krypto.org
Sun Jun 29 08:26:24 CEST 2014


On Jun 28, 2014 12:49 PM, "Ben Hoyt" <benhoyt at gmail.com> wrote:

>> But the underlying system calls -- FindFirstFile / >> FindNextFile on Windows and readdir on Linux and OS X -- > > What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir? I guess it'd be better to say "Windows" and "Unix-based OSs" throughout the PEP? Because all of these (including Mac OS X) are Unix-based.

No, Just say POSIX.

> It looks like the WIN32FINDDATA has a dwFileAttributes field. So we > should mimic statresult recent addition: the new > statresult.fileattributes field. Add DirEntry.fileattributes which > would only be available on Windows. > > The Windows structure also contains > > FILETIME ftCreationTime; > FILETIME ftLastAccessTime; > FILETIME ftLastWriteTime; > DWORD nFileSizeHigh; > DWORD nFileSizeLow; > > It would be nice to expose them as well. I'm no more surprised that > the exact API is different depending on the OS for functions of the os > module. I think you've misunderstood how DirEntry.lstat() works on Windows -- it's basically a no-op, as Windows returns the full stat information with the original FindFirst/FindNext OS calls. This is fairly explict in the PEP, but I'm sure I could make it clearer: DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows So you can already get the dwFileAttributes for free by saying entry.lstat().stfileattributes. You can also get all the other fields you mentioned for free via .lstat() with no additional OS calls on Windows, for example: entry.lstat().stsize. Feel free to suggest changes to the PEP or scandir docs if this isn't clear. Note that isdir()/isfile()/issymlink() are free on all systems, but .lstat() is only free on Windows. > Does your implementation uses a free list to avoid the cost of memory > allocation? A short free list of 10 or maybe just 1 may help. The free > list may be stored directly in the generator object. No, it doesn't. I might add this to the PEP under "possible improvements". However, I think the speed increase by removing the extra OS call and/or disk seek is going to be way more than memory allocation improvements, so I'm not sure this would be worth it. > Does it support also bytes filenames on UNIX? > Python now supports undecodable filenames thanks to the PEP 383 > (surrogateescape). I prefer to use the same type for filenames on > Linux and Windows, so Unicode is better. But some users might prefer > bytes for other reasons. I forget exactly now what my scandir module does, but for os.scandir() I think this should behave exactly like os.listdir() does for Unicode/bytes filenames. > Crazy idea: would it be possible to "convert" a DirEntry object to a > pathlib.Path object without losing the cache? I guess that > pathlib.Path expects a full statresult object. The main problem is that pathlib.Path objects explicitly don't cache stat info (and Guido doesn't want them to, for good reason I think). There's a thread on python-dev about this earlier. I'll add it to a "Rejected ideas" section. > I don't understand how you can build a full lstat() result without > really calling stat. I see that WIN32FINDDATA contains the size, but > here you call lstat(). See above. > Do you plan to continue to maintain your module for Python < 3.5, but_ _> upgrade your module for the final PEP? Yes, I intend to maintain the standalone scandir module for 2.6 <=_ _Python < 3.5, at least for a good while. For integration into the_ _Python 3.5 stdlib, the implementation will be integrated into_ _posixmodule.c, of course._ _>> Should there be a way to access the full path? >> ---------------------------------------------- >> >> Should DirEntry's have a way to get the full path without using >> os.path.join(path, entry.name)? This is a pretty common pattern, >> and it may be useful to add pathlib-like str(entry) functionality. >> This functionality has also been requested in issue 13 on GitHub. >> >> .. issue 13: https://github.com/benhoyt/scandir/issues/13 > > I think that it would be very convinient to store the directory name > in the DirEntry. It should be light, it's just a reference. > > And provide a fullname() name which would just return > os.path.join(path, entry.name) without trying to resolve path to get > an absolute path. Yeah, fair suggestion. I'm still slightly on the fence about this, but I think an explicit fullname() is a good suggestion. Ideally I think it'd be better to mimic pathlib.Path.str() which is kind of the equivalent of fullname(). But how does pathlib deal with unicode/bytes issues if it's the str function which has to return a str object? Or at least, it'd be very weird if str() returned bytes. But I think it'd need to if you passed bytes into scandir(). Do others have thoughts? > Would it be hard to implement the wildcard feature on UNIX to compare > performances of scandir('*.jpg') with and without the wildcard built > in os.scandir? It's a good idea, the problem with this is that the Windows wildcard implementation has a bunch of crazy edge cases where *.ext will catch more things than just a simple regex/glob. This was discussed on python-dev or python-ideas previously, so I'll dig it up and add to a Rejected Ideas section. In any case, this could be added later if there's a way to iron out the Windows quirks. -Ben


Python-Dev mailing list Python-Dev at python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg%40krypto.org -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140628/adca5370/attachment-0001.html>



More information about the Python-Dev mailing list