[Python-Dev] Updates to PEP 471, the os.scandir() proposal (original) (raw)

Ben Hoyt benhoyt at gmail.com
Wed Jul 9 03:08:03 CEST 2014


I did better than that -- I read the whole thing! ;)

Thanks. :-)

-1 on the PEP's implementation.

Just like an attribute does not imply a system call, having a method named 'isdir' /does/ imply a system call, and not having one can be just as misleading.

Why does a method imply a system call? os.path.join() and str.lower() don't make system calls. Isn't it just a matter of clear documentation? Anyway -- less philosophical discussion below.

If we have this:

size = 0 for entry in scandir('/some/path'): size += entry.stsize - on Windows, this should Just Work (if I have the names correct ;) - on Posix, etc., this should fail noisily with either an AttributeError ('entry' has no 'stsize') or a TypeError (cannot add None) and the solution is equally simple: for entry in scandir('/some/path', stat=True): - if not Windows, perform a stat call at the same time

I'm not totally opposed to this, which is basically a combination of Nick Coghlan's and Paul Moore's recent proposals mentioned in the PEP. However, as discussed on python-dev, there are some edge cases it doesn't handle very well, and it's messier to handle errors (requires onerror as you mention below).

I presume you're suggesting that is_dir/is_file/is_symlink should be regular attributes, and accessing them should never do a system call. But what if the system doesn't support d_type (eg: Solaris) or the d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The options are:

  1. scandir() would always call lstat() in the case of missing/unknown d_type. If so, scandir() is actually more expensive than listdir(), and as a result it's no longer safe to implement listdir in terms of scandir:

def listdir(path='.'): return [e.name for e in scandir(path)]

  1. Or would it be better to have another flag like scandir(path, type=True) to ensure the is_X type info is fetched? This is explicit, but also getting kind of unwieldly.

  2. A third option is for the is_X attributes to be absent in this case (hasattr tests required, and the user would do the lstat manually). But as I noted on python-dev recently, you basically always want is_X, so this leads to unwieldly and code that's twice as long as it needs to be. See here: https://mail.python.org/pipermail/python-dev/2014-July/135312.html

  3. I gather in your proposal above, scandir will call lstat() if stat=True? Except where does it put the values? Surely it should return an existing stat_result object, rather than stuffing everything onto the DirEntry, or throwing away some values on Linux? In this case, I'd prefer Nick Coghlan's approach of ensure_lstat and a .stat_result attribute. However, this still has the "what if d_type is missing or DT_UNKNOWN" issue.

It seems to me that making is_X() methods handles this exact scenario -- methods are so you don't have to do the dirty work.

So yes, the real world is messy due to missing is_X values, but I think it's worth getting this right, and is_X() methods can do this while keeping the API simple and cross-platform.

Now, of course, we might get errors. I am not a big fan of wrapping everything in try/except, particularly when we already have a model to follow -- os.walk:

I don't mind the onerror too much if we went with this kind of approach. It's not quite as nice as a standard try/except around the method call, but it's definitely workable and has a precedent with os.walk().

It seems a bit like we're going around in circles here, and I think we have all the information and options available to us, so I'm going to SUMMARIZE.

We have a choice before us, a fork in the road. :-) We can choose one of these options for the scandir API:

  1. The current PEP 471 approach. This solves the issue with d_type being missing or DT_UNKNOWN, it doesn't require onerror, and it's a really tidy API that doesn't explode with AttributeErrors if you write code on Windows (without thinking too hard) and then move to Linux. I think all of these points are important -- the cross-platform one not the least, because we want to make it easy, even trivial, for people to write cross-platform code.

For reference, here's what get_tree_size() looks like with this approach, not including error handling with try/except:

def get_tree_size(path): total = 0 for entry in os.scandir(path): if entry.is_dir(): total += get_tree_size(entry.full_name) else: total += entry.lstat().st_size return total

  1. Nick Coghlan's model of only fetching the lstat value if ensure_lstat=True, and including an onerror callback for error handling when scandir calls lstat internally. However, as described, we'd also need an ensure_type=True option, so that scandir() isn't way slower than listdir() if you actually don't want the is_X values and d_type is missing/unknown.

For reference, here's what get_tree_size() looks like with this approach, not including error handling with onerror:

def get_tree_size(path): total = 0 for entry in os.scandir(path, ensure_type=True, ensure_lstat=True): if entry.is_dir: total += get_tree_size(entry.full_name) else: total += entry.lstat_result.st_size return total

I'm fairly strongly in favour of approach #1, but I wouldn't die if everyone else thinks the benefits of #2 outweigh the somewhat less nice API.

Comments and votes, please!

-Ben



More information about the Python-Dev mailing list