Issue 28564: shutil.rmtree is inefficient due to listdir() instead of scandir() (original) (raw)

Created on 2016-10-30 20:02 by enkore, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
shutil-rmtree-scandir.patch serhiy.storchaka,2016-11-07 17:29 review
bench_rmtree.py serhiy.storchaka,2016-11-24 08:27
Pull Requests
URL Status Linked Edit
PR 4085 merged serhiy.storchaka,2017-10-23 13:25
Messages (9)
msg279745 - (view) Author: Marian Beermann (enkore) Date: 2016-10-30 20:02
The use of os.listdir severely limits the speed of this function on anything except solid-state drives. Using the new-in-Python 3.5 os.scandir should eliminate this bottleneck.
msg279801 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2016-10-31 16:37
You need to cache the names up front because the loop is unlinking entries, and readdir isn't consistent when the directory entries are mutated between calls. https://github.com/kripken/emscripten/issues/2528 FindFirstFile/FindNextFile likely has similar issues, even if they're not consistently seen (due to DeleteFile itself not guaranteeing deletion until the last handle to the file is closed). scandir might save some stat calls, but you'd need to convert it from generator to list before the loop begins, which would limit the savings a bit.
msg279813 - (view) Author: Marian Beermann (enkore) Date: 2016-10-31 17:40
The main issue on *nix is more likely that by using listdir you get directory order, while what you really need is inode ordering. scandir allows for that, since you get the inode from the DirEntry with no extra syscalls - especially without an open() or stat(). Other optimizations are also possible. For example opening the directory and using unlinkat() would likely shave off a bit of CPU. But the dominating factor here is likely the bad access pattern.
msg280213 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-07 17:29
Proposed patch implements shutil.rmtree using os.scandir. Needed file descriptors support in os.scandir (). I did not test how this affects the performance of shutil.rmtree.
msg281618 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-24 08:27
Benchmarks show about 20% speed up.
msg305000 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-25 16:25
Following Antoine's suggestion the patch now makes shutil.rmtree() using os.scandir() on all platforms. I doubt about one thing. This patch changes os.listdir passed to the onerror handler to os.scandir. This can break a user code that checks if the first argument in onerror is os.listdir. If keep this change, it should be documented in the "Porting to 3.7" section. Alternatively, we can continue passing os.listdir if os.scandir() failed despites the fact that os.listdir no longer used.
msg305001 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-10-25 16:25
I think we should change to os.scandir. No need to accumulate compatibility baggage like that.
msg305556 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-04 12:16
New changeset d4d79bc1ff91b04625c312f0219c89aabcd19ce4 by Serhiy Storchaka in branch 'master': bpo-28564: Use os.scandir() in shutil.rmtree(). (#4085) https://github.com/python/cpython/commit/d4d79bc1ff91b04625c312f0219c89aabcd19ce4
msg309219 - (view) Author: Niklas Hambüchen (nh2) Date: 2017-12-30 05:15
I've filed https://bugs.python.org/issue32453, which is about O(n^2) deletion behaviour for large directories.
History
Date User Action Args
2022-04-11 14:58:38 admin set github: 72750
2017-12-30 05:15:42 nh2 set nosy: + nh2messages: +
2017-11-04 12:46:52 serhiy.storchaka set status: open -> closedresolution: fixedstage: patch review -> resolved
2017-11-04 12:16:38 serhiy.storchaka set messages: +
2017-10-25 16:25:57 pitrou set messages: +
2017-10-25 16:25:02 serhiy.storchaka set nosy: + pitroumessages: +
2017-10-23 13:25:24 serhiy.storchaka set pull_requests: + <pull%5Frequest4055>
2016-11-24 08:27:44 serhiy.storchaka set files: + bench_rmtree.pymessages: +
2016-11-07 17:29:17 serhiy.storchaka set files: + shutil-rmtree-scandir.patchkeywords: + patchmessages: + stage: patch review
2016-10-31 23:53:40 serhiy.storchaka set dependencies: + Add support of file descriptor in os.scandir()
2016-10-31 17:40:23 enkore set messages: +
2016-10-31 16:37:10 josh.r set nosy: + josh.rmessages: +
2016-10-31 11:23:29 xiang.zhang set nosy: + xiang.zhang
2016-10-30 20:40:04 serhiy.storchaka set nosy: + serhiy.storchakatype: enhancementversions: + Python 3.7, - Python 3.5
2016-10-30 20:02:39 enkore create