Issue 10826: pass_fds sometimes fails (original) (raw)

Issue10826

Created on 2011-01-04 14:44 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
spdoors.patch pitrou,2011-01-29 10:48
Messages (10)
msg125326 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-04 14:44
This happens selectively (and intermittently) on the Solaris buildbot: test_pass_fds (test.test_subprocess.POSIXProcessTestCase) ... -- maxfd = 256 [36787 refs] -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13} -- fds that remained open: {0, 1, 2, 3, 5} FAIL test_pass_fds (test.test_subprocess.ProcessTestCasePOSIXPurePython) ... -- maxfd = 256 [36787 refs] -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13} -- fds that remained open: {0, 1, 2, 3, 5} FAIL As you see (thanks to debug output), all fds have been closed except number 5. (http://www.python.org/dev/buildbot/all/builders/sparc%20solaris10%20gcc%203.x/builds/2466/steps/test/logs/stdio)
msg125329 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-04 14:55
Another (perhaps more likely) possibility is that fd 5 was properly closed, but another one created at startup by the child Python interpreter. How could we diagnose that?
msg125358 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-04 19:10
Happened also on the OpenIndiana buildbot: http://www.python.org/dev/buildbot/all/builders/x86%20OpenIndiana%203.x/builds/492/steps/test/logs/stdio
msg127060 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-25 21:55
So, I added some debug info to test_pass_fds: -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13} -- fds that remained open: {0, 1, 2, 3, 5} -- debug info: 0 8194 posix.stat_result(st_mode=8592, st_ino=12582920, st_dev=82051072, st_nlink=1, st_uid=1140, st_gid=7, st_size=0, st_atime=1295990436, st_mtime=1295991135, st_ctime=1295990436) 1 2 posix.stat_result(st_mode=4096, st_ino=79119887, st_dev=84148224, st_nlink=0, st_uid=1140, st_gid=1, st_size=0, st_atime=1295991135, st_mtime=1295991135, st_ctime=1295991135) 2 8194 posix.stat_result(st_mode=8592, st_ino=12582920, st_dev=82051072, st_nlink=1, st_uid=1140, st_gid=7, st_size=0, st_atime=1295990436, st_mtime=1295991135, st_ctime=1295990436) 3 2 posix.stat_result(st_mode=4096, st_ino=79119882, st_dev=84148224, st_nlink=0, st_uid=1140, st_gid=1, st_size=0, st_atime=1295991135, st_mtime=1295991135, st_ctime=1295991135) 5 8192 posix.stat_result(st_mode=53540, st_ino=56, st_dev=84410368, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1269710532, st_mtime=1269710532, st_ctime=1269710532) FAIL Each line is in the following form: <value returned by F_GETFD> <os.fstat(fd)> Here is my interpretation for fd 5 (which is the offending fd): - a F_GETFD value of 8192 is different from both pipes (which return 2) and standard stdin/stderr (which return 8194); if the values haven't changed between Solaris versions, 8192 is O_LARGEFILE + O_RDONLY - a st_mode of 53540 is stat.S_IFSOCK + stat.S_IFIFO + stat.S_IRUSR + stat.S_IRGRP + stat.S_IROTH; so this seems to be some kind of read-only (filesystem-based?) socket I think it is unlikely that this socket is inherited from the parent process, or present before calling exec(). Instead, perhaps it's the startup of the child Python interpreter (after exec()) which creates such a file descriptor, and doesn't close it. Unfortunately, it seems difficult to diagnose this in more detail.
msg127063 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-25 22:51
I managed to get further debug info from the OpenIndiana buildbot: -- maxfd = 65536 -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13} -- fds that remained open: {0, 1, 2, 3, 5} -- debug info: 0 2 posix.stat_result(st_mode=4096, st_ino=38002789, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994652, st_mtime=1295994652, st_ctime=1295994652) 1 2 posix.stat_result(st_mode=4096, st_ino=38019207, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994988, st_mtime=1295994988, st_ctime=1295994988) 2 2 posix.stat_result(st_mode=4096, st_ino=38002791, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994987, st_mtime=1295994987, st_ctime=1295994987) 3 2 posix.stat_result(st_mode=4096, st_ino=38019202, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994988, st_mtime=1295994988, st_ctime=1295994988) 5 8192 posix.stat_result(st_mode=53540, st_ino=210, st_dev=146014208, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1290335283, st_mtime=1290335283, st_ctime=1290335283) total 2 dr-x------ 2 buildbot buildbot 528 Jan 25 22:36 . dr-x--x--x 5 buildbot buildbot 864 Jan 25 22:36 .. p--------- 0 buildbot buildbot 0 Jan 25 22:30 0 p--------- 0 buildbot buildbot 0 Jan 25 22:36 1 p--------- 0 buildbot buildbot 0 Jan 25 22:36 2 p--------- 0 buildbot buildbot 0 Jan 25 22:36 3 D--------- 1 root root 0 Nov 21 10:28 5 File: `/proc/22816/fd/5' Size: 0 Blocks: 0 IO Block: 0 weird file Device: 8b40000h/146014208d Inode: 210 Links: 1 Access: (0000/D---------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-11-21 10:28:03.659679536 +0000 Modify: 2010-11-21 10:28:03.659679536 +0000 Change: 2010-11-21 10:28:03.659679536 +0000 So, the offending fd (5) points to a "weird file" (!!) created in November 2010... It definitely can't be the same file as pointed to by fd 5 in the parent process, since that is a standard anonymous pipe. This reinforces the idea that starting up the child Python interpreter sometimes creates that fd and doesn't close it. Also, while the sparc solaris buildbot doesn't have the "stat" command, it still displays the following debug output: total 4 dr-x------ 2 buildbot other 528 Jan 25 23:58 . dr-x--x--x 5 buildbot other 832 Jan 25 23:58 .. c--------- 1 buildbot tty 24, 2 Jan 25 23:58 0 p--------- 0 buildbot other 0 Jan 25 23:58 1 c--------- 1 buildbot tty 24, 2 Jan 25 23:58 2 p--------- 0 buildbot other 0 Jan 25 23:58 3 D--------- 1 root root 0 Mar 27 2010 5 ... where fd number 5 seems to be the same kind of "weird file" created by root a long time ago. Martin, does that date (Mar 27 2010) ring a bell? Is it when the system was installed? When the buildslave was started?
msg127098 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-26 10:41
'D' denotes a door file. There are several of them in the system; the following ones (from /var/run) match the time stamp: Dr--r--r-- 1 daemon daemon 0 Mar 27 2010 kcfd_door Dr--r--r-- 1 root root 0 Mar 27 2010 name_service_door Dr--r--r-- 1 root root 0 Mar 27 2010 picld_door Drw-r--r-- 1 root root 0 Mar 27 2010 syslog_door It's common for library routines to open a door and keep it open for the entire process lifetime. If this was somehow reproducible, I could try to find out what specific door gets opened.
msg127099 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-26 10:48
> 'D' denotes a door file. There are several of them in the system; the > following ones (from /var/run) match the time stamp: Thanks for the explanation! > It's common for library routines to open a door and keep it open for > the entire process lifetime. > > If this was somehow reproducible, I could try to find out what > specific door gets opened. test_subprocess fails quite often on your buildbot, so with a bit of patience you can probably reproduce it for sure. If you use "-r --randseed 2463396" it will schedule test_subprocess quite early, so you can stop the test run as soon as that test is finished. Also, debug output has been added in branches/test_subprocess_10826 : http://www.python.org/dev/buildbot/all/builders/sparc%20solaris10%20gcc%203.x/builds/2574/steps/test/logs/stdio
msg127399 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-29 10:48
Here is a patch to ignore door files when running the fd_status script. It seems to eradicate the intermittent failures on the solaris buildbots.
msg128035 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-02-06 01:05
Your patch makes sense to me. I'll commit it after the 3.2 release for 3.2.1. The elease manager can feel free to commit it earlier if it is bothering an important buildbot.
msg129001 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-21 21:59
Committed in r88484 (3.x), r88485 (3.2).
History
Date User Action Args
2022-04-11 14:57:10 admin set github: 55035
2011-02-21 21:59:10 pitrou set status: open -> closednosy:loewis, gregory.p.smith, pitroumessages: + resolution: fixedstage: resolved
2011-02-06 01:05:31 gregory.p.smith set assignee: gregory.p.smithmessages: + nosy:loewis, gregory.p.smith, pitrou
2011-01-29 10:48:53 pitrou set files: + spdoors.patchmessages: + keywords: + patchnosy:loewis, gregory.p.smith, pitrou
2011-01-26 10:48:09 pitrou set nosy:loewis, gregory.p.smith, pitroumessages: +
2011-01-26 10:41:17 loewis set nosy:loewis, gregory.p.smith, pitroumessages: +
2011-01-25 22:51:24 pitrou set nosy: + loewismessages: +
2011-01-25 21:55:43 pitrou set messages: +
2011-01-04 19:10:23 pitrou set messages: +
2011-01-04 14:55:07 pitrou set messages: +
2011-01-04 14:44:49 pitrou create