[Python-Dev] Remaining decisions on PEP 471 -- os.scandir() (original) (raw)
Ben Hoyt benhoyt at gmail.com
Mon Jul 14 02:33:16 CEST 2014
- Previous message: [Python-Dev] Another case for frozendict
- Next message: [Python-Dev] Remaining decisions on PEP 471 -- os.scandir()
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi folks,
Thanks Victor, Nick, Ethan, and others for continued discussion on the scandir PEP 471 (most recent thread starts at https://mail.python.org/pipermail/python-dev/2014-July/135377.html).
Just an aside ... I was reminded again recently why scandir() matters: a scandir user emailed me the other day, saying "I used scandir to dump the contents of a network dir in under 15 seconds. 13 root dirs, 60,000 files in the structure. This will replace some old VBA code embedded in a spreadsheet that was taking 15-20 minutes to do the exact same thing." I asked if he could run scandir's benchmark.py on his directory tree, and here's what it printed out:
C:\Python34\scandir-master>benchmark.py "\my\network\directory" Using fast C version of scandir Priming the system's cache... Benchmarking walks on \my\network\directory, repeat 1/3... Benchmarking walks on \my\network\directory, repeat 2/3... Benchmarking walks on \my\network\directory, repeat 3/3... os.walk took 8739.851s, scandir.walk took 129.500s -- 67.5x as fast
That's right -- os.walk() with scandir was almost 70x as fast as the current version! Admittedly this is a network file system, but that's still a real and important use case. It really pays not to throw away information the OS gives you for free. :-)
On the recent python-dev thread, Victor especially made some well thought out suggestions. It seems to me there's general agreement that the basic API in PEP 471 is good (with Ethan not a fan at first, but it seems he's on board after further discussion :-).
That said, I think there's basically one thing remaining to decide: whether or not to have DirEntry.is_dir() and .is_file() follow symlinks by default. I think Victor made a pretty good case that:
(a) following links is usually what you want (b) that's the precedent set by the similar functions os.path.isdir() and pathlib.Path.is_dir(), so to do otherwise would be confusing (c) with the non-link-following version, if you wanted to follow links you'd have to say something like "if (entry.is_symlink() and os.path.isdir(entry.full_name)) or entry.is_dir()" instead of just "if entry.is_dir()" (d) it's error prone to have to do (c), as I found out recently when I had a bug in my implementation of os.walk() with scandir -- I had a bug due to getting this exact test wrong
If we go with Victor's link-following .is_dir() and .is_file(), then we probably need to add his suggestion of a follow_symlinks=False parameter (defaults to True). Either that or you have to say "stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit less nice.
As a KISS enthusiast, I admit I'm still somewhat partial to the DirEntry methods just returning (non-link following) info about the directory entry itself. However, I can definitely see the error-proneness of that, and the advantages given the points above. So I guess I'm on the fence.
Given the above arguments for symlink-following is_dir()/is_file() methods (have I missed any, Victor?), what do others think?
I'd be very keen to come to a consensus on this, so that I can make some final updates to the PEP and see about getting it accepted and/or implemented. :-)
-Ben
- Previous message: [Python-Dev] Another case for frozendict
- Next message: [Python-Dev] Remaining decisions on PEP 471 -- os.scandir()
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]