Issue 33021: Some fstat() calls do not release the GIL, possibly hanging all threads (original) (raw)

Issue33021

process

Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ZackerySpytz, brett.cannon, eric.snow, josh.r, ncoghlan, nirs, pitrou, pmpp, rbcollins, serhiy.storchaka, twouters, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2018-03-07 23:03 by nirs, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
mmap_nfs_test.py nirs,2018-03-16 23:57
mmap_size_nfs_test.py nirs,2018-03-16 23:58
fdopen_nfs_test.py nirs,2018-03-16 23:58
Pull Requests
URL Status Linked Edit
PR 6019 merged nirs,2018-03-07 23:07
PR 6020 closed nirs,2018-03-07 23:14
PR 6117 merged ZackerySpytz,2018-03-14 19:07
PR 6159 merged miss-islington,2018-03-20 18:51
PR 6160 merged miss-islington,2018-03-20 18:51
Messages (17)
msg313407 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-07 23:03
If the file descriptor is on a non-responsive NFS server, calling fstat() can block for long time, hanging all threads. Most of the fstat() calls release the GIL around the call, but some calls seems to be forgotten. In python 3, the calls are handled now by _py_fstat(), releasing the GIL internally, but some calls use _py_fstat_noraise() which does not release the GIL. Most of the calls to _py_fstat_noraise() release the GIL around the call, except these 2 calls, affecting users of: - mmap.mmap() - os.urandom() - random.seed() In python there are more fstat() calls to fix, affecting users of: - imp.load_dynamic() - imp.load_source() - mmap.mmap() - mmapobject.size() - os.fdopen() - os.urandom() - random.seed()
msg313409 - (view) Author: pmp-p (pmpp) * Date: 2018-03-07 23:21
is fstat thread safe ?
msg313417 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2018-03-08 02:24
fstat is async signal safe, and I suspect it's thread safe in general, though usage does rely on the file descriptor remaining valid. If the fd in question is closed in another thread after the GIL is released, fstat would fail; if a new file is opened and assigned the same fd, it would give erroneous results. But I don't think any of that would lead to fundamentally incorrect behavior (after all, you could do the same thing manually by caching an fd then closing the file).
msg313419 - (view) Author: pmp-p (pmpp) * Date: 2018-03-08 03:01
josh.r. i think you are right, i was worried if a nfs sillyrename is in progress, for eg a lock file ,then server hangs but thread lift the GIL and allow another thread to try to start to fstat the same path.
msg313434 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-08 11:28
Python cannot protect raw file descriptor from bad multi-threaded application. For example the application may close a file descriptor twice which may lead to closing unrelated file descriptor created by another thread just after it was closed, before the second close. This issue affects all function using raw file descriptors, and we cannot protect them with the GIL. Even if fstat was not thread safe, we cannot protect it using the GIl since this blocks the entire application until fstat returns.
msg313627 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-11 23:39
New changeset 4484f9dca9149da135bbae035f10a50d20d1cbbb by Antoine Pitrou (Nir Soffer) in branch 'master': bpo-33021: Release the GIL during fstat() calls (GH-6019) https://github.com/python/cpython/commit/4484f9dca9149da135bbae035f10a50d20d1cbbb
msg313836 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2018-03-14 19:09
Commit 4484f9dca9149da135bbae035f10a50d20d1cbbb causes GCC 7.2.0 to emit a warning. cpython/Modules/mmapmodule.c: In function β€˜new_mmap_object’: cpython/Modules/mmapmodule.c:1126:18: warning: β€˜fstat_result’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (fd != -1 && fstat_result == 0 && S_ISREG(status.st_mode)) { A PR is attached.
msg313837 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-14 20:08
New changeset d6e140466142018ddbb7541185348be2b833a2ce by Antoine Pitrou (Zackery Spytz) in branch 'master': bpo-33021: Fix GCC 7 warning (-Wmaybe-uninitialized) in mmapmodule.c (#6117) https://github.com/python/cpython/commit/d6e140466142018ddbb7541185348be2b833a2ce
msg313839 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-14 20:10
With Benjamin we've decided that this wouldn't happen in 2.7. Performance improvements don't warrant a backport. Thank you Nir for reporting and posting the patches!
msg313980 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-16 23:57
Antoine, thanks for fixing this on master! but I don't think this issue can be closed yet. First, the issue is not a performance but reliability. I probably made bad choice when I marked this as performance. When you call mmap.mmap() in one thread, the entire process hangs for an hour because the file descriptor is on a non-responsive NFS server. With the fix, only the thread accessing the file descriptor is affected. The rest of the system can function normally. Second, the issue affects python 2.7, which is the production version on many servers, and will be for many years e.g. on RHEL/CentOS 7. I think it is important to fix this issue for these users. Here is examples of the issue using reproducer scripts I uploaded to the bug. When mmap.mmap block, the entire process hangs. I unblocked the process from another shell by removing the iptables rule. # python bpo-33021/mmap_nfs_test.py mnt dumbo.tlv.redhat.com 2018-03-17 01:17:57,846 - (MainThread) - Starting canary thread 2018-03-17 01:17:57,846 - (Canary) - Blocking access to storage 2018-03-17 01:17:57,857 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP 2018-03-17 01:17:57,857 - (Canary) - check 0 2018-03-17 01:17:58,858 - (Canary) - check 1 2018-03-17 01:17:59,858 - (Canary) - check 2 2018-03-17 01πŸ”ž00,859 - (Canary) - check 3 2018-03-17 01πŸ”ž01,859 - (Canary) - check 4 2018-03-17 01πŸ”ž02,859 - (Canary) - check 5 2018-03-17 01πŸ”ž03,860 - (Canary) - check 6 2018-03-17 01πŸ”ž04,860 - (Canary) - check 7 2018-03-17 01πŸ”ž05,861 - (Canary) - check 8 2018-03-17 01πŸ”ž06,861 - (Canary) - check 9 2018-03-17 01πŸ”ž07,862 - (Canary) - check 10 2018-03-17 01πŸ”ž07,868 - (MainThread) - Calling mmap.mmap (I remove the iptables rule here) 2018-03-17 01πŸ”ž57,683 - (MainThread) - OK 2018-03-17 01πŸ”ž57,683 - (MainThread) - Done 2018-03-17 01πŸ”ž57,683 - (Canary) - check 11 When mmapobject.size() was called, the entire process was hang. I unblocked the process from another shell by removing the iptables rule. # python bpo-33021/mmap_size_nfs_test.py mnt dumbo.tlv.redhat.com 2018-03-17 01:22:17,991 - (MainThread) - Starting canary thread 2018-03-17 01:22:17,992 - (Canary) - Blocking access to storage 2018-03-17 01:22:18,001 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP 2018-03-17 01:22:18,001 - (Canary) - check 0 2018-03-17 01:22:19,002 - (Canary) - check 1 2018-03-17 01:22:20,002 - (Canary) - check 2 2018-03-17 01:22:21,002 - (Canary) - check 3 2018-03-17 01:22:22,003 - (Canary) - check 4 2018-03-17 01:22:23,003 - (Canary) - check 5 2018-03-17 01:22:24,004 - (Canary) - check 6 2018-03-17 01:22:25,004 - (Canary) - check 7 2018-03-17 01:22:26,004 - (Canary) - check 8 2018-03-17 01:22:27,005 - (Canary) - check 9 2018-03-17 01:22:28,005 - (MainThread) - Calling mmapobject.size (I removed the ipatables rule here) 2018-03-17 01:23:38,701 - (MainThread) - OK 2018-03-17 01:23:38,701 - (MainThread) - Done 2018-03-17 01:23:38,701 - (Canary) - check 10 I found that os.fdopen issue does not affect RHEL/CentOS 7, because they use python 2.7.5, and the issue was introduced in python 2.7.7, in: commit 5c863bf93809cefeb4469512eadac291b7046051 Author: Benjamin Peterson <benjamin@python.org> Date: Mon Apr 14 19:45:46 2014 -0400 when an exception is raised in fdopen, never close the fd (changing on my mind on #21191) This issue affects Fedora (python 2.7.14) and probably other distros using latest python 2.7. Here is example run show how this affects Fedora 27: # python fdopen_nfs_test.py mnt dumbo.tlv.redhat.com 2018-03-17 01:43:52,718 - (MainThread) - Starting canary thread 2018-03-17 01:43:52,718 - (Canary) - Blocking access to storage 2018-03-17 01:43:52,823 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP 2018-03-17 01:43:52,824 - (Canary) - check 0 2018-03-17 01:43:53,824 - (Canary) - check 1 2018-03-17 01:43:54,824 - (Canary) - check 2 2018-03-17 01:43:55,825 - (Canary) - check 3 2018-03-17 01:43:56,825 - (Canary) - check 4 2018-03-17 01:43:57,825 - (Canary) - check 5 2018-03-17 01:43:58,826 - (Canary) - check 6 2018-03-17 01:43:59,826 - (Canary) - check 7 2018-03-17 01:44:00,826 - (Canary) - check 8 2018-03-17 01:44:01,827 - (Canary) - check 9 2018-03-17 01:44:02,827 - (Canary) - check 10 2018-03-17 01:44:02,834 - (MainThread) - Calling os.fdopen (remove iptbales rule, and force-unmount here) 2018-03-17 01:50:25,853 - (MainThread) - OK 2018-03-17 01:50:25,854 - (Canary) - check 11 2018-03-17 01:50:25,895 - (MainThread) - Done Traceback (most recent call last): File "fdopen_nfs_test.py", line 75, in os.unlink(filename) OSError: [Errno 2] No such file or directory: 'mnt/test' So, I think we should: - backport to 3.7, 3.6 - reconsider backport to 2.7, at least for mmap and os.fdopen. I can prepare the backports and split the 2.7 patch if this helps.
msg313981 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-16 23:58
Attaching reproducer for mmapobject.size()
msg313982 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-16 23:58
Attaching reproducer for os.fdopen()
msg314024 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2018-03-18 06:04
Regarding 2.7: if folks want this fixed on RHEL/CentOS, then they need to talk to Red Hat about it, not the upstream CPython devs. I used to work there, and was told multiple times by Red Hat executives that none of their customers actually used Python, so the Python ecosystem wasn't of any strategic interest to them, and hence zero funding was available for full-time upstream CPython maintenance econtributions. As such, I no longer consider it acceptable for anyone to ask community volunteers to compensate for Red Hat's abject negligence and executive incompetence.
msg314083 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2018-03-19 08:25
Re: backporting this. I think backporting to 3.6/3.7 makes a lot of sense - bugfix and prerelease respectively. For 2.7, this bug has been around for ages, the patch is small, and I have no objection - but the RM has already said no, so I guess not happening :). That said, if I was analyzing this from scratch I'd look at the pypi download stats to assess what footprint 2.7 still has, and whether taking this fix there would make objective sense. While it is a genuine bug - and a nice catch - I have to say that running any Python with mmapped data off of NFS is a supremely bad idea :).
msg314158 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-20 18:48
Reopening for 3.x backports, as per Robert's advice.
msg314159 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-20 19:16
New changeset 56cce1ca84344f519f7ed01024434c083031d354 by Antoine Pitrou (Miss Islington (bot)) in branch '3.7': [3.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6159) https://github.com/python/cpython/commit/56cce1ca84344f519f7ed01024434c083031d354
msg314161 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-20 19:40
New changeset f3e6eadbcf4f3e0fe53f4784485b1c8464c7d282 by Antoine Pitrou (Miss Islington (bot)) in branch '3.6': [3.6] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6160) https://github.com/python/cpython/commit/f3e6eadbcf4f3e0fe53f4784485b1c8464c7d282
History
Date User Action Args
2022-04-11 14:58:58 admin set github: 77202
2018-03-20 19:40:42 pitrou set status: open -> closedstage: patch review -> resolved
2018-03-20 19:40:20 pitrou set messages: +
2018-03-20 19:16:33 pitrou set messages: +
2018-03-20 18:51:43 miss-islington set pull_requests: + <pull%5Frequest5917>
2018-03-20 18:51:07 miss-islington set stage: resolved -> patch reviewpull_requests: + <pull%5Frequest5916>
2018-03-20 18:48:09 pitrou set status: closed -> openmessages: + versions: + Python 3.6, Python 3.7
2018-03-19 08:25:55 rbcollins set nosy: + rbcollinsmessages: +
2018-03-18 06:04:33 ncoghlan set messages: +
2018-03-16 23:58:46 nirs set files: + fdopen_nfs_test.pymessages: +
2018-03-16 23:58:26 nirs set files: + mmap_size_nfs_test.pymessages: +
2018-03-16 23:57:41 nirs set files: + mmap_nfs_test.pymessages: +
2018-03-14 20:10:30 pitrou set status: open -> closedversions: - Python 2.7, Python 3.6, Python 3.7messages: + resolution: fixedstage: patch review -> resolved
2018-03-14 20:08:10 pitrou set messages: +
2018-03-14 19:09:47 ZackerySpytz set nosy: + ZackerySpytzmessages: +
2018-03-14 19:07:24 ZackerySpytz set pull_requests: + <pull%5Frequest5880>
2018-03-11 23:39:24 pitrou set nosy: + pitroumessages: +
2018-03-08 11:28:39 nirs set messages: +
2018-03-08 03:01:58 pmpp set messages: +
2018-03-08 02:24:30 josh.r set nosy: + josh.rmessages: +
2018-03-07 23:21:46 pmpp set nosy: + pmppmessages: +
2018-03-07 23:14:13 nirs set pull_requests: + <pull%5Frequest5787>
2018-03-07 23:07:33 nirs set keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest5786>
2018-03-07 23:03:24 nirs create