GitHub - benhoyt/scandir: Better directory iterator and faster os.walk(). Archived, as this has been in the stdlib since Python 3.5. (original) (raw)
scandir, a better directory iterator and faster os.walk()
**Note: This repository is archived, because the "scandir" feature has been included in the Python standard library since Python 3.5.**You can still download scandir from PyPI if you want, but I won't be making any more updates, and this repo has been archived. You're better off using os.scandir()and os.walk() directly from the Python standard library.
You can read the story of how I contributed scandir to Python, and I've kept the old README below.
scandir() is a directory iteration function like os.listdir(), except that instead of returning a list of bare filenames, it yieldsDirEntry objects that include file type and stat information along with the name. Using scandir() increases the speed of os.walk()by 2-20 times (depending on the platform and file system) by avoiding unnecessary calls to os.stat() in most cases.
Now included in a Python near you!
scandir has been included in the Python 3.5 standard library asos.scandir(), and the related performance improvements toos.walk() have also been included. So if you're lucky enough to be using Python 3.5 (release date September 13, 2015) you get the benefit immediately, otherwise justdownload this module from PyPI, install it with pip install scandir, and then do something like this in your code:
Use the built-in version of scandir/walk if possible, otherwise
use the scandir module version
try: from os import scandir, walk except ImportError: from scandir import scandir, walk
PEP 471, which is the PEP that proposes including scandir in the Python standard library, was acceptedin July 2014 by Victor Stinner, the BDFL-delegate for the PEP.
This scandir module is intended to work on Python 2.7+ and Python 3.4+ (and it has been tested on those versions).
Background
Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling listdir() on each directory -- it callsstat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.
In practice, removing all those extra system calls makes os.walk() about7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we're not talking about micro-optimizations. See more benchmarks in the "Benchmarks" section below.
Somewhat relatedly, many people have also asked for a version ofos.listdir() that yields filenames as it iterates instead of returning them as one big list. This improves memory efficiency for iterating very large directories.
So as well as a faster walk(), scandir adds a new scandir() function. They're pretty easy to use, but see "The API" below for the full docs.
Benchmarks
Below are results showing how many times as fast scandir.walk() is thanos.walk() on various systems, found by running benchmark.py with no arguments:
| System version | Python version | Times as fast |
|---|---|---|
| Windows 7 64-bit | 2.7.7 64-bit | 10.4 |
| Windows 7 64-bit SSD | 2.7.7 64-bit | 10.3 |
| Windows 7 64-bit NFS | 2.7.6 64-bit | 36.8 |
| Windows 7 64-bit SSD | 3.4.1 64-bit | 9.9 |
| Windows 7 64-bit SSD | 3.5.0 64-bit | 9.5 |
| Ubuntu 14.04 64-bit | 2.7.6 64-bit | 5.8 |
| Mac OS X 10.9.3 | 2.7.5 64-bit | 3.8 |
All of the above tests were done using the fast C version of scandir (source code in _scandir.c).
Note that the gains are less than the above on smaller directories and greater on larger directories. This is why benchmark.py creates a test directory tree with a standardized size.
The API
walk()
The API for scandir.walk() is exactly the same as os.walk(), so justread the Python docs.
scandir()
The full docs for scandir() and the DirEntry objects it yields are available in the Python documentation here. But below is a brief summary as well.
scandir(path='.') -> iterator of DirEntry objects for given path
Like listdir, scandir calls the operating system's directory iteration system calls to get the names of the files in the givenpath, but it's different from listdir in two ways:
- Instead of returning bare filename strings, it returns lightweight
DirEntryobjects that hold the filename string and provide simple methods that allow access to the additional data the operating system may have returned. - It returns a generator instead of a list, so that
scandiracts as a true iterator instead of returning the full list immediately.
scandir() yields a DirEntry object for each file and sub-directory in path. Just like listdir, the '.'and '..' pseudo-directories are skipped, and the entries are yielded in system-dependent order. Each DirEntry object has the following attributes and methods:
name: the entry's filename, relative to the scandirpathargument (corresponds to the return values ofos.listdir)path: the entry's full path name (not necessarily an absolute path) -- the equivalent ofos.path.join(scandir_path, entry.name)is_dir(*, follow_symlinks=True): similar topathlib.Path.is_dir(), but the return value is cached on theDirEntryobject; doesn't require a system call in most cases; don't follow symbolic links iffollow_symlinksis Falseis_file(*, follow_symlinks=True): similar topathlib.Path.is_file(), but the return value is cached on theDirEntryobject; doesn't require a system call in most cases; don't follow symbolic links iffollow_symlinksis Falseis_symlink(): similar topathlib.Path.is_symlink(), but the return value is cached on theDirEntryobject; doesn't require a system call in most casesstat(*, follow_symlinks=True): likeos.stat(), but the return value is cached on theDirEntryobject; does not require a system call on Windows (except for symlinks); don't follow symbolic links (likeos.lstat()) iffollow_symlinksis Falseinode(): return the inode number of the entry; the return value is cached on theDirEntryobject
Here's a very simple example of scandir() showing use of theDirEntry.name attribute and the DirEntry.is_dir() method:
def subdirs(path): """Yield directory names not starting with '.' under given path.""" for entry in os.scandir(path): if not entry.name.startswith('.') and entry.is_dir(): yield entry.name
This subdirs() function will be significantly faster with scandir than os.listdir() and os.path.isdir() on both Windows and POSIX systems, especially on medium-sized or large directories.
Further reading
- The Python docs for scandir
- PEP 471, the (now-accepted) Python Enhancement Proposal that proposed adding
scandirto the standard library -- a lot of details here, including rejected ideas and previous discussion
Flames, comments, bug reports
Please send flames, comments, and questions about scandir to Ben Hoyt:
File bug reports for the version in the Python 3.5 standard libraryhere, or file bug reports or feature requests for this module at the GitHub project page: