The simple code attached causes a deadlock in Linux. Problem is I have to slightly muck around with it depending on the distro and python version to get it to deadlock. On the cluster I use the most (python 3.6.3, CentOS Linux release 7.4.1708, pytorch 0.4.0 with no CUDA), the code attached causes a deadlock.
IMHO it's an issue with your usage of the torch module which is not part of the Python stdlib, so I suggest to close this issue as "third party" or "not a bug".
Hi Victor and Yang, Thanks for your fast replies. I did initially think it could be a torch issue. Indeed, I have an equivalent numpy testcase that does not deadlock. However, the fact that it gets stuck inside a multiprocessing wait statement makes me think it's still a multiprocessing issue. I've spent two weeks full time on this issue. Over at torch forums I've had no replies ( https://discuss.pytorch.org/t/multiprocessing-code-works-using-numpy-but-deadlocked-using-pytorch/20473 ). On stackexchange I only got a workaround suggestion that works sporadically ( https://stackoverflow.com/questions/51093970/multiprocessing-code-works-using-numpy-but-deadlocked-using-pytorch). Basically I can get rid of the deadlock (sometimes) if I impose only one thread per process. But this is not a solution anyway. I have tried stepping through the code, but because it is multiprocessed, you cannot step through it (at least not in the conventional way, since the main thread is not doing the heavy lifting). I've tried adding print statements in the multiprocess library and mucking around with it a bit, but debugging multi-processed code in this way is an absolute nightmare because you can't even trust the order in which print statements display on the screen. And probably more relevant, I'm out of my league here. I'm really at a complete dead end. I'm blocked and my work cannot progress without fixing this issue. I'd be very grateful if you could try to reproduce and rule out the multiprocessing library. If you need help reproducing I can send a different testcase that deadlocked on my friend's Mac (for him, the original testcase did not deadlock). Testcase I attached in my original post it sometimes deadlocks and sometimes doesn't, depending on the machine I run on. So I'm not suprised you got no deadlock when you tried to reproduce. I can always get it deadlocking on Linux/Mac though, by tweaking the code. To give you a sense of how unreliably it deadlocks, just removing the for loop in the code (which is outside the multiprocessing portion of the code!) somehow gets rid of the deadlock. Also, it never deadlocks on Windows. If you could provide any help on this issue I'd be very grateful. Regards, Guillaume. On Fri, Jul 6, 2018 at 11:21 AM STINNER Victor <report@bugs.python.org> wrote: > > STINNER Victor <vstinner@redhat.com> added the comment: > > IMHO it's an issue with your usage of the torch module which is not part > of the Python stdlib, so I suggest to close this issue as "third party" or > "not a bug". > > ---------- > nosy: +vstinner > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue34059> > _______________________________________ >
A friend of mine has suggested a fix that seems to work for now (upgrade numpy from 1.14.3 to 1.14.5). This makes no sense at all but it does seem to work for now. I have a strong suspicion that this is just masking the problem and that it will reappear. However, since it works I would not want you to waste any time on this. I will reopen if the deadlock reappears! I do apologize if you already spent a lot of time on this. Regards, Guillaume
History
Date
User
Action
Args
2022-04-11 14:59:02
admin
set
github: 78240
2018-07-06 18:51:56
gobbedy
set
status: open -> closedresolution: third partymessages: + stage: resolved