msg93570 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2009-10-05 01:40 |
test_multiprocessing is producing tracebacks and hanging on py3k trunk. I think this started happening fairly recently, as it seemed to work in one not-too-old checkout until I did an svn up and then it started failing. 3.1 seems fine. I've attached the tracebacks, including the tracebacks I get when I ctrl-C to break the hang. My system is Gentoo Linux. |
|
|
msg93571 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2009-10-05 02:14 |
r75241 seems to be the cause. Tests passes on r75240, but gives the dict error and hang on r75241. |
|
|
msg93601 - (view) |
Author: Jesse Noller (jnoller) *  |
Date: 2009-10-05 13:18 |
Adding gps - I haven't had a chance to grok the changes, and me getting to a fix is going to take a few. |
|
|
msg93605 - (view) |
Author: Jesse Noller (jnoller) *  |
Date: 2009-10-05 15:16 |
It's passing for me on trunk/py3k at least on OS/X - I'm going to need to dredge up a linuxvm |
|
|
msg93815 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2009-10-10 00:55 |
Benjamin tried reverting the weakref patch, but that didn't fix it. Turns out the problem is the other patch in that merge, that adds saving of the exception to AssertRaises. Adding Kristjan as that was his patch. (The tests pass with the weakref patch applied and the unittest patch reverted, by the way.) |
|
|
msg93825 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-10-10 11:04 |
Interesting. I don't have a linux machine to debug this. My patch appears harmless enough. We are only storing an exception _object_, not any tracebacks or such. If this were happening on my windows machine I would put in breakpoints where the ignored assertion is being raised and find out what's going on. |
|
|
msg93854 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-10-11 12:03 |
It was a bad idea to revert the change, because we may now forget about the problem rather than fix it. It is likely that the "dictionary changed size during iteration" occurs because a GC collection gets triggered during the iteration on the weakdict. A GC collection can destroy some objects in that case because keeping an exception around (which the unittest change does) usually creates reference cycles through the frame's locals. It seems that WeakValueDictionary.items() isn't very robust in py3k ;) We probably need to add list-returning variants, e.g. a listitems() method (and, probably, listkeys() and listvalues() too)? |
|
|
msg93855 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-10-11 12:06 |
The unittest only keeps an exception _object_ around, not the associated traceback. There should be no frames and thus reference cycles associated with this. I agree that the problem should be tackled, rather than swept under the carpet like a revert does. |
|
|
msg93856 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-10-11 12:14 |
> The unittest only keeps an exception _object_ around, not the associated > traceback. In py3k, the traceback is attached to the exception (on the __traceback__ attribute) ;) |
|
|
msg93857 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2009-10-11 12:32 |
I disagree about reverting it. We have this bug report, and that change is sitting on the merge queue again... |
|
|
msg93859 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-10-11 14:08 |
I didn't realize that the traceback was attached to the exception object in py3k. This makes the use of such objects more dangerous because of the circular references. The recommended practice of exc_type, exc_obj = sys.exc_info()[:2] is precisely to avoid any "accidents" with the tracebacks. See also http://mail.python.org/pipermail/python-dev/2005- August/055251.html So, can one just clear the __traceback__ member? That would make sense for the unittests, but it would make this manifestation of a GC bug go into hiding again. |
|
|
msg94413 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-10-24 12:59 |
Kristján: > So, can one just clear the __traceback__ member? Yes, or use `exc_value.with_traceback(None)`. |
|
|
msg95173 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2009-11-13 02:37 |
If this is not ported by the time we get to a 2.7 or 3.2 release, I'd like to back it out. |
|
|
msg95175 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-11-13 09:16 |
I can easily port it again, and remove the traceback from the exception. But won't we be sweeping that GC problem under the carpet then? |
|
|
msg95178 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-13 15:04 |
Please note there's a patch in to allow safer iteration of weak dicts. |
|
|
msg95182 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-11-13 16:10 |
Ok, I ported the change again in revision 76238, removing the traceback from the stored exception object. |
|
|
msg95183 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-13 16:12 |
> Kristján Valur Jónsson <kristjan@ccpgames.com> added the comment: > > Ok, I ported the change again in revision 76238, removing the traceback > from the stored exception object. "versionchanged" should be 3.2, not 3.1. Thanks! |
|
|
msg95184 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-11-13 16:16 |
Thanks. Fixed in revision 76239. |
|
|
msg95188 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-13 17:04 |
I don't know if it's your last change, but test_multiprocessing is now hanging on all buildbots. |
|
|
msg95194 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2009-11-13 17:35 |
It is. TestZZZNumberOfObjects hangs at r76238, and does not hang at r76237 on my Gentoo Linux box. |
|
|
msg95195 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-13 18:04 |
It seems the error is different, though. This is what I get after Ctrl-C (I omit the tracebacks from the other processes): Traceback (most recent call last): File "/home/antoine/py3k/__svn__/Lib/test/test_multiprocessing.py", line 1071, in test_number_of_objects refs = self.manager._number_of_objects() File "Z:\py3k\__svn__\lib\multiprocessing\managers.py", line 569, in _number_of_objects conn = self._Client(self._address, authkey=self._authkey) File "Z:\py3k\__svn__\lib\multiprocessing\connection.py", line 134, in Client c = SocketClient(address) File "Z:\py3k\__svn__\lib\multiprocessing\connection.py", line 253, in SocketClient s.connect(address) socket.error: [Errno 2] No such file or directory |
|
|
msg95196 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-13 19:04 |
Ok, the problem is that a finalizer calls _decref which tries to open an XMLRPC connection (!) to an address on which nobody listens anymore. Then stupid SocketClient (in multiprocessing.connection) loops endlessly because it is designed to do so when it gets ECONNREFUSED. (when you Ctrl-C, you don't see all this because it happens in a finalizer, and only a terse message is displayed) So there are two problems: - why does test_multiprocessing think it should open an XMLRPC connection to a closed endpoint - why doesn't SocketClient have some kind of timeout rather than looping stupidly |
|
|
msg95203 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-11-13 21:39 |
Yes, this is quite different. After seeing this, I created issue 7314. I didn't realize that it had anything to do with my patch but rather blamed it on my brand new windows 7. |
|
|
msg95204 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-13 21:41 |
> Yes, this is quite different. > After seeing this, I created issue 7314. > I didn't realize that it had anything to do with my patch but rather > blamed it on my brand new windows 7. I don't think it has anything to do with your patch. While diagnosing the problem, I noticed that adding some random print() lines would sometimes make the problem disappear. So it is really timing-dependent and your patch happens to trigger the right (or wrong ;-)) timing. |
|
|
msg95205 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-11-13 21:42 |
See my analysis from issue 7314: The other processes are all waiting to ReadFile(). Maybe they have yet to run to the point where they open the socket to which the master process is trying to connect to. The mystery is also: Why does the harmless act of setting an instance attribute on a context manager (and without any traceback) trigger this behaviour? |
|
|
msg95206 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) *  |
Date: 2009-11-13 21:44 |
Ah, okay then. Let's focus on solving 7414 then. |
|
|
msg95372 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-11-17 10:07 |
I don't think this is a blocker anymore. Kristján's patch has been ported and the test_multiprocessing freezes have been "fixed" by adding a timeout. As for the original purpose of this ticket, is still (part of?) the answer. |
|
|
msg203249 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-11-18 08:13 |
I never seen this issue, can we close it? #7105 has been fixed in Python 3, not in Python 2.7. |
|
|
msg205226 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2013-12-04 15:09 |
Okay, let's say it is fixed. Adding Richard to nosy so that he can review the issue if he's interested. |
|
|
msg320627 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-06-27 21:30 |
> RuntimeError: dictionary changed size during iteration I didn't see this bug the last 5 years, so I consider that the issue has been fixed in the meanwhile :-) |
|
|