Issue 14417: dict RuntimeError workaround (original) (raw)

Issue14417

process

Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Jim.Jewett, asvetlov, gregory.p.smith, gvanrossum, ncoghlan, pitrou, python-dev, r.david.murray, rhettinger, skrah, vstinner
Priority: normal Keywords: patch

Created on 2012-03-26 18:16 by Jim.Jewett, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
hammer_dict.py ncoghlan,2012-03-31 17:00 Attempt to trigger the 3.3 dict RuntimeError just with threaded code
hammer_dict_eq.py skrah,2012-03-31 21:34
hammer_dict_switchinterval.py skrah,2012-04-01 12:59
dictobject.c.patch gvanrossum,2012-04-01 23:51 review
Messages (31)
msg156843 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2012-03-26 18:16
Per the 3.3 WhatsNew: """issue 14205: A dict lookup now raises a RuntimeError if the dict is modified during the lookup. If you implement your own comparison function for objects used as dict keys and the dict is shared by multiple threads, access to the dict should be protected by a lock.""" This should be easier to do. My suggestion would be that the change to (general) lookdict (or possibly an additional transition, if you want a less-slow path for non-string non-container builtins) create the lock, and acquire/release that lock. At a minimum, there should be a dict subclass that does this. [Note that this is arguably a regression, since previous python versions would just retry, which would be enough to protect innocent but unlucky code.]
msg156846 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-26 18:35
I must admit to being concerned by the possible impact of this change as well.
msg157061 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-29 16:12
> I must admit to being concerned by the possible impact of this change as well. So am I.
msg157080 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-03-29 18:39
On Thu, Mar 29, 2012 at 9:12 AM, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> I must admit to being concerned by the possible impact of this change as well. > > So am I. I think it's time to bring this up in python-dev .
msg157207 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012-03-31 17:00
Attached script is a first cut at a multi-threading stress test for the new dict behaviour. It should be significantly worse than anything a real world app is likely to be doing to a dictionary that is shared between threads without any form of synchronisation. It's late here though, so it's very possible it's only passing because I missed something in the implementation of the stress test.
msg157212 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-31 17:20
Thanks, Nick, this is much better than speculation.
msg157229 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-03-31 21:34
By adding a slow __eq__() method to Nick's script I can trigger the RuntimeError reliably.
msg157254 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-01 04:00
But the sleep(0.1) forces a thread switch so I consider that still cheating -- nobody in their right mind would consider calling sleep() inside __hash__. You can probably get it to fail without this by just doing a bunch of random computation in the __hash__ (and using a low sys.setswitchinterval() value). Or consider creating a key that consists of e.g. a tuple of 100 or 1000 Key instances -- hashing that will call the __hash__ on each of those, wasting a lot of cycles.
msg157270 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-01 10:10
> But the sleep(0.1) forces a thread switch so I consider that still > cheating -- nobody in their right mind would consider calling sleep() > inside __hash__. Well, cheating is fair game when trying to test borderline cases, isn't it?
msg157283 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-04-01 12:59
OK, here's a version with a low switch interval. Of course it's also contrived, but it works. Generally I'd appreciate the RuntimeError, since it's a hint that something needs to be rewritten in an application. It might be a problem if this started to occur sporadically in a production app under a high system load. But I haven't managed to trigger the exception just by increasing the system load. I always had to use the same low switch interval.
msg157285 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-01 13:25
> OK, here's a version with a low switch interval. Of course it's also > contrived, but it works. The drawback of using setswitchinterval() is that it makes the test less reusable by other implementations (or perhaps it will succeed without actually checking the desired behaviour).
msg157313 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-01 16:27
Antoine: I don't think the point of this code is to come up with a unit (or other) test for the behavior, but to try to determine empirically whether or not this error is likely to be an issue in naive production code (whether it is existing 3.x code or stuff ported from Python2). Thus the mention of "cheating" (doing things production code would not be doing). The answer so far appears to be "no", which is good. Which in the context of this particular issue then raises the question of whether there is in fact any additional support beyond the normal threading lock facilities that we want to provide in the stdlib. And the answer to that is thus probably no as well, since code likely to run into the error is also likely to need locking around the dict in question *anyway*. Which is what Guido's intuition was to begin with.
msg157315 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-01 16:39
> Antoine: I don't think the point of this code is to come up with a > unit (or other) test for the behavior, but to try to determine > empirically whether or not this error is likely to be an issue in > naive production code (whether it is existing 3.x code or stuff ported > from Python2). Thus the mention of "cheating" (doing things production > code would not be doing). > > The answer so far appears to be "no", which is good. I find this a bit lacking. Production code is used in all kinds of settings that we didn't simulate here. Besides, a very sporadic bug is no better than an easily reproduced one. The tracker already has its share of people pointing at weird sporadic errors in their log files. > And the answer to that is thus probably no as well, since code likely > to run into the error is also likely to need locking around the dict > in question *anyway*. Depends on the application really.
msg157330 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-04-01 20:58
> Well, cheating is fair game when trying to test > borderline cases, isn't it? It is fair game (and necessary) when it comes to exposing bugs that are hard to reproduce. I'm wary of the original RuntimeError patch because * it modifies code that has been stable for over a decade * in order to "fix" a crasher that no one cares about * while possibly breaking code that currently works * and breaking it in a way that is hard to reproduce or diagnose.
msg157333 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2012-04-01 23:01
A thought prompted by Raymond's comment: did we ever try just protecting the retry as a recursive call? If we can stop the stack blowing up, it seems to me we'd get the best of both worlds (that is, crashes become RuntimeError, but naive multi-threaded code is unaffected).
msg157335 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-01 23:44
On Sun, Apr 1, 2012 at 1:58 PM, Raymond Hettinger <report@bugs.python.org> wrote: [...] > I'm wary of the original RuntimeError patch because [...] I had retorts to most of what you wrote, but decided not to post them. Instead, I really want to have a serious discussion about this issue, without falling back on "if it ain't broke don't fix it". There was definitely *something* wrong with the original code: there was a known crasher and it had XXX comments in it, and some folks cared enough to attempt to fix it. In the end I see this more as a matter of clarifying the semantics and limitations of dict operations: *should* the language guarantee that the built-in dict implementation supports correct lookup even if the in the middle of the lookup the hash table is resized by another thread? Is this a reasonable guarantee to impose on all Python implementations? Because I don't see how this can be guaranteed without a lock; but I haven't tried very hard, so this is not a rhetorical question. - - - On Sun, Apr 1, 2012 at 4:01 PM, Nick Coghlan <report@bugs.python.org> wrote: > A thought prompted by Raymond's comment: did we ever try just protecting > the retry as a recursive call? If we can stop the stack blowing up, it > seems to me we'd get the best of both worlds (that is, crashes become > RuntimeError, but naive multi-threaded code is unaffected). Hm... We considered a variety of fixes in http://bugs.python.org/issue14205 but I don't think we tried that. (The nastiness is that it's strictly a C-level recursion -- lookdict() calls lookdict().) Another thing we didn't try was manually eliminating the tail recursion in lookdict() using a goto (and perhaps some variable assignments). I don't see the potential for an infinite loop as a problem, since one could be created just as easily by putting "while True: pass" inside a user-defined __hash__(). PS. I'm not surprised your originall hammer_dict.py didn't provoke the problem -- it only overrides __hash__(), but lookdict() never calls that. It only calls the compare function; the hash is computed once and passed into it. Finally, the guard in lookdict() looks fishy to me: if (ep0 == mp->ma_table && ep->me_key == startkey) { Can't this evaluate to true when the dict has shrunk dramatically in size? I think an extra test for "mask == mp->ma_mask" ought to be added.
msg157337 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-01 23:51
Here's a hack that uses goto instead of recursion to restore the original behavior, less the stack overflow. With this, hammer_dict_switchinterval.py loops forever (which is I think what it's supposed to do if RuntimeError is never raised).
msg157340 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-02 01:02
Why delete that? On Sunday, April 1, 2012, Raymond Hettinger wrote: > > Changes by Raymond Hettinger <raymond.hettinger@gmail.com javascript:;>: > > > ---------- > Removed message: http://bugs.python.org/msg157338 > > _______________________________________ > Python tracker <report@bugs.python.org javascript:;> > <http://bugs.python.org/issue14417> > _______________________________________ >
msg157562 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-04-05 11:25
A counter can be added to dictobject.c.patch to avoid an infinite loop.
msg157604 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-05 16:50
> A counter can be added to dictobject.c.patch to avoid an infinite loop. But why bother? There was no counter originally -- it would continue until it ran out of stack. Since this can only be triggered if there is Python code in the __eq__, that means that it is still interruptable with ^C. Does this mean I should just check it in? But I asked, and never got an answer, whether the original stress test had been converted into a unittest. I'd like that to happen before I check this in. Also there are probably docs I've missed. Somebody please help!
msg157605 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-05 16:59
> Does this mean I should just check it in? But I asked, and never got > an answer, whether the original stress test had been converted into a > unittest. I'd like that to happen before I check this in. Also there > are probably docs I've missed. Somebody please help! I think your approach is fine. As far as I can tell, the original crasher hasn't been converted into an unit test, since Victor's test in 934aaf2191d0 doesn't seem prone to triggering an infinite loop. As far as doc changes go, you should probably remove the what's new entry from a428f85de29c, and the doc addition from 0255bafbccf2. Besides, does the test suite pass with your patch? It looks like Victor's test, which checks that RuntimeError is raised, should now fail.
msg157616 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2012-04-05 20:17
>> A counter can be added to dictobject.c.patch to avoid >> an infinite loop. > But why bother? Without a termination condition, how is this different from just reverting the original change? Is it just the slightly improvement in efficiency?
msg157617 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-05 20:20
> Without a termination condition, how is this different from just > reverting the original change? The difference is that it's a (presumably interruptible) infinite loop, not a stack overflow. There's no crash and therefore no security issue.
msg157629 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-04-05 22:39
> The difference is that it's a (presumably interruptible) infinite loop, > not a stack overflow. There's no crash and therefore no security issue. Ok, it looks fair. The patch looks good, but as written by Antoine: tests may need to be fixed. Is anyone motivated to "convert" removed crashers to unit tests? Not me. Crashers were only written to crash Python. New tests should be written instead.
msg159523 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-28 14:20
Still no progress on this bug. Should I just check in my simple patch? But there's much more to do -- docs, and unittests. Volunteers? It's not hard, just work.
msg159524 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-28 14:32
> Still no progress on this bug. Should I just check in my simple patch? > But there's much more to do -- docs, and unittests. Volunteers? It's > not hard, just work. Well, in general the person writing the patch should also write the tests ;-) I have this bug in my bookmarks somewhere, so I might come to it and write a patch if nobody does it before, though.
msg159609 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-29 14:30
I could check it in, but I probably would mess up something (which branches are affected?). Let me know if you want me to. The priorities after that would be: 1) update docs (the warning about RuntimeError needs to be moderated) 2) convert stress tests to unittests
msg159612 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-29 14:33
Actually my patch doesn't even apply cleanly. I suspect the dict refactoring for shared keys interfered. Someone please help!
msg160544 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-05-13 18:50
New changeset 93748e2d64e3 by Antoine Pitrou in branch 'default': Issue #14417: Mutating a dict during lookup now restarts the lookup instead of raising a RuntimeError (undoes issue #14205). http://hg.python.org/cpython/rev/93748e2d64e3
msg160546 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-13 18:54
I have committed an updated patch, with a fix to Victor's test. I don't think anything else remains to be done.
msg160575 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-05-13 20:49
Thanks Antoine!
History
Date User Action Args
2022-04-11 14:57:28 admin set github: 58625
2012-05-13 20:49:02 gvanrossum set messages: +
2012-05-13 18:54:52 pitrou set status: open -> closedresolution: fixedmessages: + stage: resolved
2012-05-13 18:50:18 python-dev set nosy: + python-devmessages: +
2012-04-29 14:33:05 gvanrossum set messages: +
2012-04-29 14:30:15 gvanrossum set messages: +
2012-04-28 14:32:15 pitrou set messages: +
2012-04-28 14:20:36 gvanrossum set messages: +
2012-04-05 22:39:08 vstinner set messages: +
2012-04-05 20:20:23 pitrou set messages: +
2012-04-05 20:17:46 Jim.Jewett set messages: +
2012-04-05 16:59:25 pitrou set messages: +
2012-04-05 16:50:06 gvanrossum set messages: +
2012-04-05 11:25:02 vstinner set messages: +
2012-04-02 01:02:37 gvanrossum set messages: +
2012-04-01 23:58:29 rhettinger set messages: -
2012-04-01 23:56:54 rhettinger set messages: +
2012-04-01 23:51:27 gvanrossum set files: + dictobject.c.patchkeywords: + patchmessages: +
2012-04-01 23:44:30 gvanrossum set messages: +
2012-04-01 23:01:12 ncoghlan set messages: +
2012-04-01 20:58:10 rhettinger set nosy: + rhettingermessages: +
2012-04-01 16:39:06 pitrou set messages: +
2012-04-01 16:27:04 r.david.murray set messages: +
2012-04-01 13:25:40 pitrou set messages: +
2012-04-01 12:59:57 skrah set files: + hammer_dict_switchinterval.pymessages: +
2012-04-01 10:10:08 pitrou set messages: +
2012-04-01 04:00:01 gvanrossum set messages: +
2012-03-31 21:34:11 skrah set files: + hammer_dict_eq.pynosy: + skrahmessages: +
2012-03-31 20:40:27 Arfrever set nosy: + Arfrever
2012-03-31 17:20:57 r.david.murray set messages: +
2012-03-31 17:00:06 ncoghlan set files: + hammer_dict.pynosy: + ncoghlanmessages: +
2012-03-30 22:58:08 gregory.p.smith set nosy: + gregory.p.smith
2012-03-29 20:06:16 asvetlov set nosy: + asvetlov
2012-03-29 18:39:30 gvanrossum set messages: +
2012-03-29 16:12:11 pitrou set nosy: + pitroumessages: +
2012-03-27 00:56:25 vstinner set nosy: + vstinner
2012-03-26 19:52:06 vstinner set nosy: + gvanrossum
2012-03-26 18:35:48 r.david.murray set nosy: + r.david.murraymessages: +
2012-03-26 18:16:51 Jim.Jewett create