[Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info (original) (raw)
MRAB python at mrabarnett.plus.com
Fri May 10 16:30:54 CEST 2013
- Previous message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Next message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 10/05/2013 11:55, Ben Hoyt wrote:
A few of us were having a discussion at http://bugs.python.org/issue11406 about adding os.scandir(): a generator version of os.listdir() to make iterating over very large directories more memory efficient. This also reflects how the OS gives things to you -- it doesn't give you a big list, but you call a function to iterate and fetch the next entry.
While I think that's a good idea, I'm not sure just that much is enough of an improvement to make adding the generator version worth it. But what would make this a killer feature is making os.scandir() generate tuples of (name, statlikeinfo). The Windows directory iteration functions (FindFirstFile/FindNextFile) give you the full stat information for free, and the Linux and OS X functions (opendir/readdir) give you partial file information (dtype in the dirent struct, which is basically the stmode part of a stat, whether it's a file, directory, link, etc). Having this available at the Python level would mean we can vastly speed up functions like os.walk() that otherwise need to make an os.stat() call for every file returned. In my benchmarks of such a generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X, it's more like 1.5-3x. In my opinion, that kind of gain is huge, especially on Windows, but also on Linux/OS X. So the idea is to add this relatively low-level function that exposes the extra information the OS gives us for free, but which os.listdir() currently throws away. Then higher-level, platform-independent functions like os.walk() could use os.scandir() to get much better performance. People over at Issue 11406 think this is a good idea. HOWEVER, there's debate over what kind of object the second element in the tuple, "statlikeinfo", should be. My strong vote is for it to be a statresult-like object, but where the fields are None if they're unknown. There would be basically three scenarios: 1) statresult with all fields set: this would happen on Windows, where you get as much info from FindFirst/FindNext as from an os.stat() 2) statresult with just stmode set, and all other fields None: this would be the usual case on Linux/OS X 3) statresult with all fields None: this would happen on systems whose readdir()/dirent doesn't have dtype, or on Linux/OS X when dtype was DTUNKNOWN Higher-level functions like os.walk() would then check the fields they needed are not None, and only call os.stat() if needed, for example: # Build lists of files and directories in path files = [] dirs = [] for name, st in os.scandir(path): if st.stmode is None: st = os.stat(os.path.join(path, name)) if stat.SISDIR(st.stmode): dirs.append(name) else: files.append(name) Not bad for a 2-10x performance boost, right? What do folks think? Cheers, Ben. [snip] In the python-ideas list there's a thread "PEP: Extended stat_result" about adding methods to stat_result.
Using that, you wouldn't necessarily have to look at st.st_mode. The method could perform an additional os.stat() if the field was None. For example:
Build lists of files and directories in path
files = [] dirs = [] for name, st in os.scandir(path): if st.is_dir(): dirs.append(name) else: files.append(name)
That looks much nicer.
- Previous message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Next message: [Python-Dev] Issue 11406: adding os.scandir(), a directory iterator returning stat-like info
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]