[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Sun Jun 29 14:28:14 CEST 2014


On 29 June 2014 21:45, Paul Moore <p.f.moore at gmail.com> wrote:

On 29 June 2014 12:08, Nick Coghlan <ncoghlan at gmail.com> wrote:

This is what makes me wary of including lstat, even though Windows offers it without the extra stat call. Caching behaviour is really hard to make intuitive, especially when it sometimes returns data that looks fresh (as it on first call on POSIX systems). If it matters that much we could simply call it cachedlstat(). It's ugly, but I really don't like the idea of throwing the information away - after all, the fact that we currently throw data away is why there's even a need for scandir. Let's not make the same mistake again...

Future-proofing is the reason DirEntry is a full fledged class in the first place, though.

Effectively communicating the behavioural difference between DirEntry and pathlib.Path is the main thing that makes me nervous about adhering too closely to the Path API.

To restate the problem and the alternative proposal, these are the DirEntry methods under discussion:

is_dir(): like os.path.isdir(), but requires no system calls on at

least POSIX and Windows is_file(): like os.path.isfile(), but requires no system calls on at least POSIX and Windows is_symlink(): like os.path.islink(), but requires no system calls on at least POSIX and Windows lstat(): like os.lstat(), but requires no system calls on Windows

For the almost-certain-to-be-cached items, the suggestion is to make them properties (or just ordinary attributes):

is_dir
is_file
is_symlink

What do with lstat() is currently less clear, since POSIX directory scanning doesn't provide that level of detail by default.

The PEP also doesn't currently state whether the is_dir(), is_file() and is_symlink() results would be updated if a call to lstat() produced different answers than the original directory scanning process, which further suggests to me that allowing the stat call to be delayed on POSIX systems is a potentially problematic and inherently confusing design. We would have two options:

Those both sound ugly to me.

So, here's my alternative proposal: add an "ensure_lstat" flag to scandir() itself, and don't have any methods on DirEntry, only attributes.

That would make the DirEntry attributes:

is_dir: boolean, always populated
is_file: boolean, always populated
is_symlink boolean, always populated
lstat_result: stat result, may be None on POSIX systems if

ensure_lstat is False

(I'm not particularly sold on "lstat_result" as the name, but "lstat" reads as a verb to me, so doesn't sound right as an attribute name)

What this would allow:

Most importantly, regardless of platform, the cached stat result (if not None) would reflect the state of the entry at the time the directory was scanned, rather than at some arbitrary later point in time when lstat() was first called on the DirEntry object.

There'd still be a slight window of discrepancy (since the filesystem state may change between reading the directory entry and making the lstat() call), but this could be effectively eliminated from the perspective of the Python code by making the result of the lstat() call authoritative for the whole DirEntry object.

Regards, Nick.

P.S. We'd be generating quite a few of these, so we can use slots to keep the memory overhead to a minimum (that's just a general comment - it's really irrelevant to the methods-or-attributes question).

-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia



More information about the Python-Dev mailing list