Issue 742911: Memory fault on complex weakref/weakkeydict delete (original) (raw)

Created on 2003-05-24 22:29 by mcfletch, last changed 2022-04-10 16:08 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
cache.py	mcfletch,2003-05-24 22:29	Cache implementation file.
test_cache_problem.py	mcfletch,2003-05-24 22:30	Driver script for test-case
test_cache_problem.py	nnorwitz,2003-05-24 23:31	updated test to be minimal
temp2.py	tim.peters,2003-05-28 23:26	an even smaller test case

Messages (15)
msg16138 - (view)	Author: Mike C. Fletcher (mcfletch)	Date: 2003-05-24 22:29
Attached find two modules which together form a test-case. The cache.py file is ripped out of a production system (OpenGLContext), and I am seeing memory faults under both Python 2.2.2 and 2.2.3 when I run the code. Under 2.2.2 while single-stepping through the code I was able to provoke an error-message: Fatal Python error: GC object already in linked list The error message doesn't show up under 2.2.3, but the memory-fault does. Modules here don't use any extension modules, so there shouldn't be any loose memory references or the like. Note, you'll likely need to make weakkeydictionary's __delitem__ use keys instead of iterkeys to even get to the crashing.
msg16139 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2003-05-24 23:31
Logged In: YES user_id=33168 I cut out a lot of stuff from the test. The new file is much more minimal, but still crashes for me in a 2.3 debug build. You only need the one file too (not both files). There is an issue with new style classes. If Items doesn't derive from object, I don't get a crash.
msg16140 - (view)	Author: Tim Peters (tim.peters) *	Date: 2003-05-25 02:53
Logged In: YES user_id=31435 Outstanding, Neal -- thanks! I can confirm that this crashes in a current 2.3 debug build on Windows too. I'm feeling sick and won't pursue it now, though. When cleaning up from the call to makeSome() (the body of makeSome() has completed, and we're cleaning up its execution frame, decref'ing the locals), we're dying in _Py_ForgetReference with a NULL-pointer derefernce. The refcount on an Items object has just fallen to 0, and the code is trying to verify that the object is in the debug-build "list of all objects". But its prev and next pointers are both NULL -- it's not in the list, and simply trying to check that it is blows up. I don't have a theory for what's causing this, but it's probably not a good thing .
msg16141 - (view)	Author: Tim Peters (tim.peters) *	Date: 2003-05-25 04:49
Logged In: YES user_id=31435 Assigned to Guido, because I suspect the problem lies in the order subtype_dealloc does things. With reference to Neal's whittled-down case: when makeSome() ends, we decref i and then decref item. item's refcount hits 0 then. There's a weakref remaining to item (in CacheHolder.client), but subtype_dealloc doesn't clear the weakref at this point. First it clears item's instance dict. That contains the last strong reference to i. subtype_dealloc is inovked again, and clears i's instance dict, and then deals with the weak reference to i. The weakref to i has a callback associated with it, and CacheHolder.__call__() is invoked. That invokes self.client (), still a weakref to item, and because item's weakrefs still haven't been dealt with, self.client() returns item. Now we're hosed. item had a refcount of 0 at this point, and is still in the process of getting cleaned out by the first call to subtype_dealloc (it still thinks it's clearing item's instance dict). We already called _Py_ForgetReference on item when its refcount fell to 0. Its refcount gets boosted back to 1 by virtue of item getting returned by the self.client() weakref call. Cleaning out the frame for CacheHolder.__call__() knocks the refcount down to 0 again, and the second attempt to call _Py_ForgetReference on it blows up. In a release build, nothing visibly bad happens when I try it. It crashes if I add print client.items at the end of __call__ in a release-build run, though. Looks like that's just the luck of the draw after things have gone insane. I note that instance_dealloc deals with weakrefs much earlier in the process, so that when Neal makes Items a classic class instead, the eventual call to self.client() returns None (instead of items), and nothing bad happens..
msg16142 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-05-28 22:16
Logged In: YES user_id=6380 Tim, let's look at this when you're back in the office. My head spins from just reading the analysis below. Note that this is a 2.2 and 2.3 bug. I don't necessarily want to hold up the 2.2.3 release until this is fixed, unless we have a quick breakthrough.
msg16143 - (view)	Author: Tim Peters (tim.peters) *	Date: 2003-05-28 23:26
Logged In: YES user_id=31435 That's cool. The analysis is much easier to follow if you run Neal's little program, which crashes right away in a debug build. Then the analysis just follows the call stack, and it's instantly obvious (if you know to look for it ) that we're trying to deallocate a single object twice. One thing subtype_delloc does that instance_dealloc doesn't is clear the instance dict before clearing the weakrefs. Clearing the instance dict can end up executing anything, and in this case does, in particular materializing a strong reference to the object we're in the middle of deallocating (via a weakref that hasn't been cleared). That seems to be the key point. Mike found the problem in 2.2.2, I believe in his real-life OpenGL code. That's why I'm keen to see it fixed for 2.2.3. I'll be in the office Thursday, BTW. Ah, here, I'll attach a simpler program that has the same kind of problem (temp2.py).
msg16144 - (view)	Author: Mike C. Fletcher (mcfletch)	Date: 2003-05-29 01:16
Logged In: YES user_id=34901 To give timeline: This is from real-life code (PyOpenGL's OpenGLContext demo/teaching module). It's currently about to go beta, with a release date target of end-of-summer. I've worked around the problem for the most common case w/in the system (exiting from the VRML browser), but that work-around's not usable for end-user long-running projects until the users can deallocate the scenegraph to load another (i.e. go from world to world within a single run of the application).
msg16145 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-05-29 14:35
Logged In: YES user_id=6380 Thanks to Tim for the analysis, to Neal for the simplified test, and to Mike for the bug report! The fix was actually simple: clear the weak ref list before calling __del__ or clearing __dict__. This is the same order as used by classic classes, so can't be wrong. Applied to 2.2.3 branch and 2.3 trunk, with test case.
msg16146 - (view)	Author: Tim Peters (tim.peters) *	Date: 2003-05-29 14:53
Logged In: YES user_id=31435 Just changed Resolution to Fixed. Thanks everyone!
msg16147 - (view)	Author: Jeff Epler (jepler)	Date: 2003-05-29 20:47
Logged In: YES user_id=2772 Just one question: does the final test-case cover the original bug? I have a redhat 9 system with Python 2.2.2, and temp2.py runs just fine: $ python2 temp2.py <__main__.Oops object at 0x8162ecc> $ The Python version is: Python 2.2.2 (#1, Feb 24 2003, 19:13:11) [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2 I have not tried any other version of the test-case. I also didn't try a non-redhat version, so it's barely possible that some patch they apply affected this issue.
msg16148 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2003-05-29 21:35
Logged In: YES user_id=33168 2.3 works for me with the original, but didn't before. Redhat 9.
msg16149 - (view)	Author: Mike C. Fletcher (mcfletch)	Date: 2003-05-31 19:21
Logged In: YES user_id=34901 2.2.3 fixes the original problem on Win2k.
msg16150 - (view)	Author: Tim Peters (tim.peters) *	Date: 2003-06-01 05:21
Logged In: YES user_id=31435 Which service pack ? Thanks for reporting this in time for 2.2.3, Mike.
msg16151 - (view)	Author: Kevin Jacobs (jacobs99)	Date: 2003-06-10 17:05
Logged In: YES user_id=459565 The fix checked in to solve this problem causes a potentially more serious one in Bug 751998. In short, it seems that data descriptors become unavailable in __del__ methods, preventing finalization from completing. I have yet to figure out why, though I have isolated the problem to revision 2.234 of typeobject.c. I've not looked to see if this also affects Python 2.2.3, though I wouldn't be too suprised if it does.
msg16152 - (view)	Author: Kevin Jacobs (jacobs99)	Date: 2003-06-10 18:25
Logged In: YES user_id=459565 See patch attached to bug 751998 for an alternative, friendlier fix for this bug.

History
Date	User	Action	Args
2022-04-10 16:08:54	admin	set	github: 38547
2003-05-24 22:29:35	mcfletch	create