[Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info (original) (raw)
Ben Hoyt benhoyt at gmail.com
Fri May 10 12:55:56 CEST 2013
- Previous message: [Python-Dev] PEP 0 maintenance - deferring some currently open PEPs
- Next message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
A few of us were having a discussion at http://bugs.python.org/issue11406 about adding os.scandir(): a generator version of os.listdir() to make iterating over very large directories more memory efficient. This also reflects how the OS gives things to you -- it doesn't give you a big list, but you call a function to iterate and fetch the next entry.
While I think that's a good idea, I'm not sure just that much is enough of an improvement to make adding the generator version worth it.
But what would make this a killer feature is making os.scandir() generate tuples of (name, stat_like_info). The Windows directory iteration functions (FindFirstFile/FindNextFile) give you the full stat information for free, and the Linux and OS X functions (opendir/readdir) give you partial file information (d_type in the dirent struct, which is basically the st_mode part of a stat, whether it's a file, directory, link, etc).
Having this available at the Python level would mean we can vastly speed up functions like os.walk() that otherwise need to make an os.stat() call for every file returned. In my benchmarks of such a generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X, it's more like 1.5-3x. In my opinion, that kind of gain is huge, especially on Windows, but also on Linux/OS X.
So the idea is to add this relatively low-level function that exposes the extra information the OS gives us for free, but which os.listdir() currently throws away. Then higher-level, platform-independent functions like os.walk() could use os.scandir() to get much better performance. People over at Issue 11406 think this is a good idea.
HOWEVER, there's debate over what kind of object the second element in the tuple, "stat_like_info", should be. My strong vote is for it to be a stat_result-like object, but where the fields are None if they're unknown. There would be basically three scenarios:
- stat_result with all fields set: this would happen on Windows, where you get as much info from FindFirst/FindNext as from an os.stat()
- stat_result with just st_mode set, and all other fields None: this would be the usual case on Linux/OS X
- stat_result with all fields None: this would happen on systems whose readdir()/dirent doesn't have d_type, or on Linux/OS X when d_type was DT_UNKNOWN
Higher-level functions like os.walk() would then check the fields they needed are not None, and only call os.stat() if needed, for example:
Build lists of files and directories in path
files = [] dirs = [] for name, st in os.scandir(path): if st.st_mode is None: st = os.stat(os.path.join(path, name)) if stat.S_ISDIR(st.st_mode): dirs.append(name) else: files.append(name)
Not bad for a 2-10x performance boost, right? What do folks think?
Cheers, Ben.
P.S. A few non-essential further notes:
As a Windows guy, a nice-to-have addition to os.scandir() would be a keyword arg like win_wildcard which defaulted to '.', but power users can pass in to utilize the wildcard feature of FindFirst/FindNext on Windows. We have plenty of other low-level functions that expose OS-specific features in the OS module, so this would be no different. But then again, it's not nearly as important as exposing the stat info.
I've been dabbling with this concept for a while in my BetterWalk library: https://github.com/benhoyt/betterwalk
Note that the benchmarks there are old, and I've made further improvements in my local copy. The ctypes version gives speed gains for os.walk() of 2-3x on Windows, but I've also got a C version, which is giving 9-10x speed gains. I haven't yet got a Linux/OS X version written in C.
- See also the previous python-dev thread on BetterWalk: http://mail.python.org/pipermail/python-ideas/2012-November/017944.html
- Previous message: [Python-Dev] PEP 0 maintenance - deferring some currently open PEPs
- Next message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]