msg156843 - (view) |
Author: Jim Jewett (Jim.Jewett) *  |
Date: 2012-03-26 18:16 |
Per the 3.3 WhatsNew: """issue 14205: A dict lookup now raises a RuntimeError if the dict is modified during the lookup. If you implement your own comparison function for objects used as dict keys and the dict is shared by multiple threads, access to the dict should be protected by a lock.""" This should be easier to do. My suggestion would be that the change to (general) lookdict (or possibly an additional transition, if you want a less-slow path for non-string non-container builtins) create the lock, and acquire/release that lock. At a minimum, there should be a dict subclass that does this. [Note that this is arguably a regression, since previous python versions would just retry, which would be enough to protect innocent but unlucky code.] |
|
|
msg156846 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-03-26 18:35 |
I must admit to being concerned by the possible impact of this change as well. |
|
|
msg157061 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-03-29 16:12 |
> I must admit to being concerned by the possible impact of this change as well. So am I. |
|
|
msg157080 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-03-29 18:39 |
On Thu, Mar 29, 2012 at 9:12 AM, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> I must admit to being concerned by the possible impact of this change as well. > > So am I. I think it's time to bring this up in python-dev . |
|
|
msg157207 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2012-03-31 17:00 |
Attached script is a first cut at a multi-threading stress test for the new dict behaviour. It should be significantly worse than anything a real world app is likely to be doing to a dictionary that is shared between threads without any form of synchronisation. It's late here though, so it's very possible it's only passing because I missed something in the implementation of the stress test. |
|
|
msg157212 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-03-31 17:20 |
Thanks, Nick, this is much better than speculation. |
|
|
msg157229 - (view) |
Author: Stefan Krah (skrah) *  |
Date: 2012-03-31 21:34 |
By adding a slow __eq__() method to Nick's script I can trigger the RuntimeError reliably. |
|
|
msg157254 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-01 04:00 |
But the sleep(0.1) forces a thread switch so I consider that still cheating -- nobody in their right mind would consider calling sleep() inside __hash__. You can probably get it to fail without this by just doing a bunch of random computation in the __hash__ (and using a low sys.setswitchinterval() value). Or consider creating a key that consists of e.g. a tuple of 100 or 1000 Key instances -- hashing that will call the __hash__ on each of those, wasting a lot of cycles. |
|
|
msg157270 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-04-01 10:10 |
> But the sleep(0.1) forces a thread switch so I consider that still > cheating -- nobody in their right mind would consider calling sleep() > inside __hash__. Well, cheating is fair game when trying to test borderline cases, isn't it? |
|
|
msg157283 - (view) |
Author: Stefan Krah (skrah) *  |
Date: 2012-04-01 12:59 |
OK, here's a version with a low switch interval. Of course it's also contrived, but it works. Generally I'd appreciate the RuntimeError, since it's a hint that something needs to be rewritten in an application. It might be a problem if this started to occur sporadically in a production app under a high system load. But I haven't managed to trigger the exception just by increasing the system load. I always had to use the same low switch interval. |
|
|
msg157285 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-04-01 13:25 |
> OK, here's a version with a low switch interval. Of course it's also > contrived, but it works. The drawback of using setswitchinterval() is that it makes the test less reusable by other implementations (or perhaps it will succeed without actually checking the desired behaviour). |
|
|
msg157313 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-04-01 16:27 |
Antoine: I don't think the point of this code is to come up with a unit (or other) test for the behavior, but to try to determine empirically whether or not this error is likely to be an issue in naive production code (whether it is existing 3.x code or stuff ported from Python2). Thus the mention of "cheating" (doing things production code would not be doing). The answer so far appears to be "no", which is good. Which in the context of this particular issue then raises the question of whether there is in fact any additional support beyond the normal threading lock facilities that we want to provide in the stdlib. And the answer to that is thus probably no as well, since code likely to run into the error is also likely to need locking around the dict in question *anyway*. Which is what Guido's intuition was to begin with. |
|
|
msg157315 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-04-01 16:39 |
> Antoine: I don't think the point of this code is to come up with a > unit (or other) test for the behavior, but to try to determine > empirically whether or not this error is likely to be an issue in > naive production code (whether it is existing 3.x code or stuff ported > from Python2). Thus the mention of "cheating" (doing things production > code would not be doing). > > The answer so far appears to be "no", which is good. I find this a bit lacking. Production code is used in all kinds of settings that we didn't simulate here. Besides, a very sporadic bug is no better than an easily reproduced one. The tracker already has its share of people pointing at weird sporadic errors in their log files. > And the answer to that is thus probably no as well, since code likely > to run into the error is also likely to need locking around the dict > in question *anyway*. Depends on the application really. |
|
|
msg157330 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2012-04-01 20:58 |
> Well, cheating is fair game when trying to test > borderline cases, isn't it? It is fair game (and necessary) when it comes to exposing bugs that are hard to reproduce. I'm wary of the original RuntimeError patch because * it modifies code that has been stable for over a decade * in order to "fix" a crasher that no one cares about * while possibly breaking code that currently works * and breaking it in a way that is hard to reproduce or diagnose. |
|
|
msg157333 - (view) |
Author: Alyssa Coghlan (ncoghlan) *  |
Date: 2012-04-01 23:01 |
A thought prompted by Raymond's comment: did we ever try just protecting the retry as a recursive call? If we can stop the stack blowing up, it seems to me we'd get the best of both worlds (that is, crashes become RuntimeError, but naive multi-threaded code is unaffected). |
|
|
msg157335 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-01 23:44 |
On Sun, Apr 1, 2012 at 1:58 PM, Raymond Hettinger <report@bugs.python.org> wrote: [...] > I'm wary of the original RuntimeError patch because [...] I had retorts to most of what you wrote, but decided not to post them. Instead, I really want to have a serious discussion about this issue, without falling back on "if it ain't broke don't fix it". There was definitely *something* wrong with the original code: there was a known crasher and it had XXX comments in it, and some folks cared enough to attempt to fix it. In the end I see this more as a matter of clarifying the semantics and limitations of dict operations: *should* the language guarantee that the built-in dict implementation supports correct lookup even if the in the middle of the lookup the hash table is resized by another thread? Is this a reasonable guarantee to impose on all Python implementations? Because I don't see how this can be guaranteed without a lock; but I haven't tried very hard, so this is not a rhetorical question. - - - On Sun, Apr 1, 2012 at 4:01 PM, Nick Coghlan <report@bugs.python.org> wrote: > A thought prompted by Raymond's comment: did we ever try just protecting > the retry as a recursive call? If we can stop the stack blowing up, it > seems to me we'd get the best of both worlds (that is, crashes become > RuntimeError, but naive multi-threaded code is unaffected). Hm... We considered a variety of fixes in http://bugs.python.org/issue14205 but I don't think we tried that. (The nastiness is that it's strictly a C-level recursion -- lookdict() calls lookdict().) Another thing we didn't try was manually eliminating the tail recursion in lookdict() using a goto (and perhaps some variable assignments). I don't see the potential for an infinite loop as a problem, since one could be created just as easily by putting "while True: pass" inside a user-defined __hash__(). PS. I'm not surprised your originall hammer_dict.py didn't provoke the problem -- it only overrides __hash__(), but lookdict() never calls that. It only calls the compare function; the hash is computed once and passed into it. Finally, the guard in lookdict() looks fishy to me: if (ep0 == mp->ma_table && ep->me_key == startkey) { Can't this evaluate to true when the dict has shrunk dramatically in size? I think an extra test for "mask == mp->ma_mask" ought to be added. |
|
|
msg157337 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-01 23:51 |
Here's a hack that uses goto instead of recursion to restore the original behavior, less the stack overflow. With this, hammer_dict_switchinterval.py loops forever (which is I think what it's supposed to do if RuntimeError is never raised). |
|
|
msg157340 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-02 01:02 |
Why delete that? On Sunday, April 1, 2012, Raymond Hettinger wrote: > > Changes by Raymond Hettinger <raymond.hettinger@gmail.com javascript:;>: > > > ---------- > Removed message: http://bugs.python.org/msg157338 > > _______________________________________ > Python tracker <report@bugs.python.org javascript:;> > <http://bugs.python.org/issue14417> > _______________________________________ > |
|
|
msg157562 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-04-05 11:25 |
A counter can be added to dictobject.c.patch to avoid an infinite loop. |
|
|
msg157604 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-05 16:50 |
> A counter can be added to dictobject.c.patch to avoid an infinite loop. But why bother? There was no counter originally -- it would continue until it ran out of stack. Since this can only be triggered if there is Python code in the __eq__, that means that it is still interruptable with ^C. Does this mean I should just check it in? But I asked, and never got an answer, whether the original stress test had been converted into a unittest. I'd like that to happen before I check this in. Also there are probably docs I've missed. Somebody please help! |
|
|
msg157605 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-04-05 16:59 |
> Does this mean I should just check it in? But I asked, and never got > an answer, whether the original stress test had been converted into a > unittest. I'd like that to happen before I check this in. Also there > are probably docs I've missed. Somebody please help! I think your approach is fine. As far as I can tell, the original crasher hasn't been converted into an unit test, since Victor's test in 934aaf2191d0 doesn't seem prone to triggering an infinite loop. As far as doc changes go, you should probably remove the what's new entry from a428f85de29c, and the doc addition from 0255bafbccf2. Besides, does the test suite pass with your patch? It looks like Victor's test, which checks that RuntimeError is raised, should now fail. |
|
|
msg157616 - (view) |
Author: Jim Jewett (Jim.Jewett) *  |
Date: 2012-04-05 20:17 |
>> A counter can be added to dictobject.c.patch to avoid >> an infinite loop. > But why bother? Without a termination condition, how is this different from just reverting the original change? Is it just the slightly improvement in efficiency? |
|
|
msg157617 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-04-05 20:20 |
> Without a termination condition, how is this different from just > reverting the original change? The difference is that it's a (presumably interruptible) infinite loop, not a stack overflow. There's no crash and therefore no security issue. |
|
|
msg157629 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-04-05 22:39 |
> The difference is that it's a (presumably interruptible) infinite loop, > not a stack overflow. There's no crash and therefore no security issue. Ok, it looks fair. The patch looks good, but as written by Antoine: tests may need to be fixed. Is anyone motivated to "convert" removed crashers to unit tests? Not me. Crashers were only written to crash Python. New tests should be written instead. |
|
|
msg159523 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-28 14:20 |
Still no progress on this bug. Should I just check in my simple patch? But there's much more to do -- docs, and unittests. Volunteers? It's not hard, just work. |
|
|
msg159524 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-04-28 14:32 |
> Still no progress on this bug. Should I just check in my simple patch? > But there's much more to do -- docs, and unittests. Volunteers? It's > not hard, just work. Well, in general the person writing the patch should also write the tests ;-) I have this bug in my bookmarks somewhere, so I might come to it and write a patch if nobody does it before, though. |
|
|
msg159609 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-29 14:30 |
I could check it in, but I probably would mess up something (which branches are affected?). Let me know if you want me to. The priorities after that would be: 1) update docs (the warning about RuntimeError needs to be moderated) 2) convert stress tests to unittests |
|
|
msg159612 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-04-29 14:33 |
Actually my patch doesn't even apply cleanly. I suspect the dict refactoring for shared keys interfered. Someone please help! |
|
|
msg160544 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-05-13 18:50 |
New changeset 93748e2d64e3 by Antoine Pitrou in branch 'default': Issue #14417: Mutating a dict during lookup now restarts the lookup instead of raising a RuntimeError (undoes issue #14205). http://hg.python.org/cpython/rev/93748e2d64e3 |
|
|
msg160546 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-05-13 18:54 |
I have committed an updated patch, with a fix to Victor's test. I don't think anything else remains to be done. |
|
|
msg160575 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2012-05-13 20:49 |
Thanks Antoine! |
|
|