[Python-Dev] My summary of the scandir (PEP 471) (original) (raw)

Ben Hoyt benhoyt at gmail.com
Tue Jul 1 15:00:32 CEST 2014


Thanks for spinning this off to (hopefully) finished the discussion. I agree it's nearly time to update the PEP.

@Ben: it's time to update your PEP to complete it with this discussion! IMO DirEntry must be as simple as possible and portable:

- os.scandir(str) - DirEntry.lstatresult object only available on Windows, same result than os.lstat() - DirEntry.fullname(): os.path.join(directory, DirEntry.name), where directory would be an hidden attribute of DirEntry

I'm quite strongly against this, and I think it's actually the worst of both worlds. It is not as good an API because:

(a) it doesn't call stat for you (on POSIX), so you have to check an attribute and call scandir manually if you need it, turning what should be one line of code into four. Your proposal above was kind of how I had it originally, where you had to do extra tests and call scandir manually if you needed it (see https://mail.python.org/pipermail/python-dev/2013-May/126119.html) (b) the .lstat_result attribute is available on Windows but not on POSIX, meaning it's very easy for Windows developers to write code that will run and work fine on Windows, but then break horribly on POSIX; I think it'd be better if it broke hard on Windows to make writing cross-platform code easy

The two alternates are:

  1. the original proposal in the current version of PEP 471, where DirEntry has an .lstat() method which calls stat() on POSIX but is free on Windows
  2. Nick Coghlan's proposal on the previous thread (https://mail.python.org/pipermail/python-dev/2014-June/135261.html) suggesting an ensure_lstat keyword param to scandir if you need the lstat_result value

I would make one small tweak to Nick Coghlan's proposal to make writing cross-platform code easier. Instead of .lstat_result being None sometimes (on POSIX), have it None always unless you specify ensure_lstat=True. (Actually, call it get_lstat=True to kind of make this more obvious.) Per (b) above, this means Windows developers wouldn't accidentally write code which failed on POSIX systems -- it'd fail fast on Windows too if you accessed .lstat_result without specifying get_lstat=True.

I'm still unsure which of these I like better. I think #1's API is slightly nicer without the ensure_lstat parameter, and error handling of the stat() is more explicit. But #2 always fetches the stat info at the same time as the dir entry info, so eliminates the problem of having the file info change between scandir iteration and the .lstat() call.

I'm leaning towards preferring #2 (Nick's proposal) because it solves or gets around the caching issue. My one concern is error handling. Is it an issue if scandir's next can raise an OSError either from the readdir() call or the call to stat()? My thinking is probably not. In practice, would it ever really happen that readdir() would succeed but an os.stat() immediately after would fail? I guess it could if the file is deleted, but then if it were deleted a microsecond earlier the readdir() would fail anyway, or not? Or does readdir give you a consistent, "snap-shotted" view on things?

The one other thing I'm not quite sure about with Nick's proposal is the name .lstat_result, as it's long. I can see why he suggested that, as .lstat sounds like a verb, but maybe that's okay? If we can have .is_dir and .is_file as attributes, my thinking is an .lstat attribute is fine too. I don't feel too strongly though.

- I don't think that we should support scandir(bytes). If you really want to support os.scandir(bytes), it must raise an error on Windows since bytes filename are already deprecated. It wouldn't make sense to add new function with a deprecated feature. Since we have the PEP 383 (surrogateescape), it's better to advice to use Unicode on all platforms. Almost all Python functions are able to encode back Unicode filename automatically. Use os.fsencode() to encode manually if needd.

Really, are bytes filenames deprecated? I think maybe they should be, as they don't work on Windows :-), but the latest Python "os" docs (https://docs.python.org/3.5/library/os.html) still say that all functions that accept path names accept either str or bytes, and return a value of the same type where necessary. So I think scandir() should do the same thing.

- We may not define a DirEntry.fullname() method: the directory name is usually well known. Ok, but every time that I use os.listdir(), I write os.path.join(directory, name) because in some cases I want the full path.

Agreed. I use this a lot too. However, I'd prefer a .fullname attribute rather than a method, as it's free/cheap to compute and doesn't require OS calls.

Out of interest, why do we have .is_dir and .stat_result but .fullname rather than .full_name? .fullname seems reasonable to me, but maybe consistency is a good thing here?

- It must not be possible to "refresh" a DirEntry object. Call os.stat(entry.fullname()) or pathlib.Path(entry.fullname()) to get fresh data. DirEntry is only computed once, that's all. It's well defined.

I agree refresh() is not needed -- just use os.stat() or pathlib.

- No Windows wildcard, you wrote that the feature has many corner cases, and it's only available on Windows. It's easy to combine scandir with fnmatch.

Agreed.

-Ben



More information about the Python-Dev mailing list