Issue 25596: Use scandir() to speed up the glob module (original) (raw)

Created on 2015-11-10 11:18 by xdegaye, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
glob_isdir.patch	xdegaye,2015-11-10 11:18	review
glob_isdir_2.patch	xdegaye,2015-11-10 15:35	review
glob_scandir.patch	serhiy.storchaka,2016-01-02 15:22	review
glob_scandir_2.patch	serhiy.storchaka,2016-01-03 08:23	review
glob_scandir_doc.patch	xdegaye,2016-01-08 13:53	review
glob_scandir_2_diff.patch	xdegaye,2016-01-10 14:53
glob_scandir_3.patch	xdegaye,2016-01-10 14:53	review
glob_scandir_4.patch	serhiy.storchaka,2016-01-11 08:33	review
glob_scandir_5.patch	serhiy.storchaka,2016-02-11 11:57	review

Messages (30)
msg254440 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2015-11-10 11:18
The glob module happily joins names of regular files together with os.path.join() or attempts to list the files contained into a regular file (sic). The same 'except os.error' statement is used to handle both these cases and the case of a non readable directory. The attached patch makes the code more correct and easier to understand.
msg254449 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2015-11-10 15:35
This second patch rewrites the conditionals that decide whether to call _iglob() recursively, in a more natural way and without changing the behavior from the first patch. Note that when 'dirname == pathname', then basename is empty and glob2() or glob1() are not called. The patch also refactors the code with a new _listdir() function.
msg257347 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-01-02 15:22
In general the patch LGTM. It allows to speed up glob by 5-10% in warn test. But much more gain we can achieve by using os.scandir(). Here are results of microbenchmarks: $ ./python -m timeit -s "from glob import glob" -- "glob('*/', recursive=True)" Unpatched: 201 msec per loop Using isdir(): 181 msec per loop Using scandir(): 65.2 msec per loop $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib//', recursive=True)" Unpatched: 2.06 sec per loop Using isdir(): 1.92 sec per loop Using scandir(): 820 msec per loop $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib/*/', recursive=True)" Unpatched: 1.77 sec per loop Using isdir(): 1.61 sec per loop Using scandir(): 431 msec per loop Yet one benefit is that iglob() now yields path names without the delay for reading the full content of a directory (see ).
msg257354 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-01-02 19:49
I don't think it is a good idea to break backward compatibility here even if the globN are internal functions. Better would be to continue to have globN functions that support the existing API and emit deprecation warnings.
msg257403 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-01-03 08:23
I think so too. I just wanted someone to confirmed that it is not overcautiousness. For now glob1() is used only in one place in the stdlib (). But there was other problem with previous patch, the same as with current implementation of os.walk() (). It makes glob to use a lot of file descriptors. Updated patch lefts deprecated glob1() and glob2() and makes glob to consume all scandir iterator before starting to yield values (but the problem with fd leaks on non-refcounted implementations is still left, ). This doesn't affect performance, but lefts the issue with delaying ().
msg257421 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2016-01-03 14:55
os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion. The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR.
msg257423 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2016-01-03 16:30
I may be missing something, anyway here are few comments on the patch: Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ? If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly)
msg257424 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-01-03 17:20
> os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion. No, os.scandir() is called non-recursively in _iterdir(), and since _iterdir() results are always accumulated in a list, a recursion starts only after exhausting the os.scandir() iterator and closing the file descriptor. We need first to resolve to close the file descriptor explicitly. > The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR. Patched code passes existing tests. Do you have additional tests? > Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ? Ah, I thought is_dir() doesn't follow symlinks. Yes, now the call to entry.is_symlink() is not necessary. > If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly) Yes, but this can complicate the rest of the code. _rlistdir() is called with dironly=False only when the pattern ends with ''. I'm not sure this is enough important case for optimization. In most cases '' is used in the middle of the pattern, and all names yielded by _iterdir() are directory names (or broken symlinks).
msg257456 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2016-01-04 11:20
> and since _iterdir() results are always accumulated in a list Right, I failed to note that point. And so, since the file descriptor opened by os.scandir() is closed within each call to recursive _rlistdir(), then my other comment about EMFILE does not stand as well.
msg257655 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-01-06 22:59
Related issue: #26032 "Use scandir() to speed up pathlib globbing".
msg257659 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2016-01-06 23:08
(IOW once this patch has been applied maybe you can do the same for globbing in pathlib as requested in issue #26032.)
msg257758 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2016-01-08 13:53
Adding a doc patch.
msg257912 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2016-01-10 14:53
FWIW I have followed the idea of having _iterdir() yielding the DirEntry entry instead of the name in Serhiy's patch. There is a slight performance gain. Now _glob0() and _glob1() do return a list of directories when dironly is true but there is now another place where OSError must be tracked, so it is not clear if this is worth it. glob_scandir_2_diff.patch is the differential patch between glob_scandir_2.patch and glob_scandir_3.patch. Here are the performance tests run with both patches. $ ./python -m timeit -s "from glob import glob" -- "glob('*/', recursive=True)" glob_scandir_2.patch: 33.1 msec per loop glob_scandir_3.patch: 33.8 msec per loop $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib//', recursive=True)" glob_scandir_2.patch: 927 msec per loop glob_scandir_3.patch: 850 msec per loop $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib/*/', recursive=True)" glob_scandir_2.patch: 423 msec per loop glob_scandir_3.patch: 337 msec per loop
msg257930 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2016-01-10 20:04
Oops, the list comprehension in _glob1() of glob_scandir_3.patch should be instead: names = [x.name for x in _iterdir(dirname, dironly)] This does not change the tests results except it improves glob('/usr/lib//', True) that gives now 820 msec instead of 927 msec with glob_scandir_2.patch.
msg257938 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2016-01-10 21:09
Actually the microbenchmarks comparison between glob_scandir_2.patch and glob_scandir_3.patch must be made by removing the call to entry.is_symlink() also in glob_scandir_2.patch. In that case the microbenchmarks give about the same results for each patch. Sorry for the noise.
msg257947 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2016-01-11 03:14
@serhiy, I assume this is in your capable hands? Personally I'd like to have seen the refactoring in a separate diff from the switch to scandir(), just to ensure that the refactoring does not change anything and to make it easier to read the changes for the scandir() switch.
msg257953 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-01-11 08:33
Thank you Xavier for correcting the documentation. Here is minimal patch that switches the glob module to scandir().
msg257954 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-01-11 12:18
Results of microbenchmarks for glob_scandir_4.patch: $ ./python -m timeit -s "from glob import glob" -- "glob('*/', recursive=True)" Unpatched: 275 msec per loop Patched: 79.4 msec per loop $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib//', recursive=True)" Unpatched: 2.11 sec per loop Patched: 624 msec per loop $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib/*/', recursive=True)" Unpatched: 1.79 sec per loop Patched: 281 msec per loop
msg257958 - (view)	Author: Ben Hoyt (benhoyt) *	Date: 2016-01-11 14:25
Nice! Good to see scandir() getting used to speed up other functions, functions that folks are using a ton already, like glob().
msg258009 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2016-01-11 20:41
Looks like switching to scandir will also resolve http://bugs.python.org/issue22167 without further action.
msg258015 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-01-11 20:56
Not this version of the path. It accumulates all scandir results in a list before yielding values. Otherwise globbing will consume a lot of open file descriptors and could leak them (see , ).
msg258016 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2016-01-11 20:58
Oh well. Too bad. Never mind then.
msg260098 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-11 11:57
The patch is completed now. The with statement is used to avoid FD leaks.
msg273912 - (view)	Author: Dima Tisnek (Dima.Tisnek) *	Date: 2016-08-30 12:18
Who can review Serhiy's patch? P.S. core of the patch seems good to me, but I'm not an expert on stdlib.
msg274611 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-06 19:33
I think the change in general is approved by GvR. If there are some implementation bugs, we can fix them later.
msg274612 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-09-06 19:35
New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default': Issue #25596: Optimized glob() and iglob() functions in the https://hg.python.org/cpython/rev/cb7ee9d9cddd
msg274696 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2016-09-07 01:26
Thanks Serhiy! --Guido (mobile) On Sep 6, 2016 12:35 PM, "Roundup Robot" <report@bugs.python.org> wrote: > > Roundup Robot added the comment: > > New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default': > Issue #25596: Optimized glob() and iglob() functions in the > https://hg.python.org/cpython/rev/cb7ee9d9cddd > > ---------- > nosy: +python-dev > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25596> > _______________________________________ >
msg274770 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-09-07 06:50
New changeset 48e573e0a610 by Serhiy Storchaka in branch 'default': Issue #25596: Falls back to listdir in glob for bytes paths on Windows. https://hg.python.org/cpython/rev/48e573e0a610
msg274783 - (view)	Author: Dima Tisnek (Dima.Tisnek) *	Date: 2016-09-07 09:09
@Serhiy please comment the implications / limitations of the fallback on Windows. Is it that scandir cannot handle bytes argument only? If argument is unicode, but response set contains bytes paths, will that work?
msg274794 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-09-07 11:10
os.scandir() cannot handle bytes argument on Windows. If an argument is string, os.scandir() yields entries with string names, if an argument is bytes object, os.scandir() yields entries with bytes names. Opened for adding support of bytes paths in os.scandir(). After this will be implemented, a workaround cb7ee9d9cddd can be reverted.

History
Date	User	Action	Args
2022-04-11 14:58:23	admin	set	github: 69782
2017-02-07 19:05:40	serhiy.storchaka	link	issue19240 superseder
2016-09-07 11:10:52	serhiy.storchaka	set	status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2016-09-07 09:09:36	Dima.Tisnek	set	messages: +
2016-09-07 06:50:02	python-dev	set	messages: +
2016-09-07 01:26:13	gvanrossum	set	messages: +
2016-09-06 19:35:28	python-dev	set	nosy: + python-devmessages: +
2016-09-06 19:33:31	serhiy.storchaka	set	messages: +
2016-08-30 12🔞16	Dima.Tisnek	set	nosy: + Dima.Tisnekmessages: +
2016-02-11 11:57:18	serhiy.storchaka	set	files: + glob_scandir_5.patchmessages: +
2016-01-11 20:58:48	gvanrossum	set	messages: +
2016-01-11 20:56:44	serhiy.storchaka	set	messages: +
2016-01-11 20:41:10	gvanrossum	set	messages: +
2016-01-11 14:25:58	benhoyt	set	messages: +
2016-01-11 12:22:00	serhiy.storchaka	set	title: Use scandir() to speed the glob module -> Use scandir() to speed up the glob module
2016-01-11 12🔞20	serhiy.storchaka	set	messages: +
2016-01-11 08:33:02	serhiy.storchaka	set	files: + glob_scandir_4.patchmessages: + title: regular files handled as directories in the glob module -> Use scandir() to speed the glob module
2016-01-11 03:14:18	gvanrossum	set	messages: +
2016-01-10 21:09:31	xdegaye	set	messages: +
2016-01-10 20:04:58	xdegaye	set	messages: +
2016-01-10 14:53:47	xdegaye	set	files: + glob_scandir_3.patch
2016-01-10 14:53:35	xdegaye	set	files: + glob_scandir_2_diff.patchmessages: +
2016-01-08 13:53:03	xdegaye	set	files: + glob_scandir_doc.patchmessages: +
2016-01-07 12:46:07	serhiy.storchaka	link	issue26032 dependencies
2016-01-07 04:11:13	r.david.murray	set	nosy: - r.david.murray
2016-01-06 23:08:37	gvanrossum	set	nosy: + gvanrossummessages: +
2016-01-06 22:59:13	vstinner	set	messages: +
2016-01-04 11:20:36	xdegaye	set	messages: +
2016-01-03 17:20:34	serhiy.storchaka	set	messages: +
2016-01-03 16:30:28	xdegaye	set	messages: +
2016-01-03 14:55:45	xdegaye	set	messages: +
2016-01-03 08:23:29	serhiy.storchaka	set	files: + glob_scandir_2.patchdependencies: + File descriptor leaks in os.scandir()messages: +
2016-01-02 19:49:16	r.david.murray	set	nosy: + r.david.murraymessages: +
2016-01-02 15:52:32	serhiy.storchaka	set	dependencies: + Avoid using private function glob.glob1() in msi module and tools
2016-01-02 15:22:53	serhiy.storchaka	set	files: + glob_scandir.patchpriority: low -> normalnosy: + vstinner, benhoytmessages: +
2015-11-10 15:58:18	serhiy.storchaka	set	priority: normal -> lowassignee: serhiy.storchakanosy: + serhiy.storchakastage: patch review
2015-11-10 15:35:35	xdegaye	set	files: + glob_isdir_2.patchmessages: +
2015-11-10 11🔞07	xdegaye	create