[Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7 (original) (raw)

noreply@sourceforge.net noreply@sourceforge.net
Thu, 18 Oct 2001 03:46:13 -0700

Previous message: [Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7
Next message: [Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Bugs item #471942, was opened at 2001-10-16 19:56 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470

Category: Python Interpreter Core Group: Python 2.1.1 Status: Open Resolution: None Priority: 5 Submitted By: Anthony Baxter (anthonybaxter) Assigned to: Neil Schemenauer (nascheme) Summary: python2.1.1 SEGV in GC on Solaris 2.7

Initial Comment: I've got a Zope installation where python2.1.1 is segfaulting on Solaris2.7 - it's running a largish ZEO server. The tail of the gdb output is:

#128 0x26164 in PyEval_CallObjectWithKeywords () #129 0x264c0 in PyEval_CallObjectWithKeywords () #130 0x26140 in PyEval_CallObjectWithKeywords () #131 0x25fc0 in PyEval_CallObjectWithKeywords () #132 0x517bc in PyInstance_New () #133 0x261a4 in PyEval_CallObjectWithKeywords () #134 0x25fc0 in PyEval_CallObjectWithKeywords () #135 0x42c90 in initgc ()

It's built with <anthony@devhost1>$ gcc -v Reading specs from /opt/local/lib/gcc-lib/sparc-sun-solaris2.7/2.95.2/specs gcc version 2.95.2 19991024 (release) which is a bit old.

I'm going to rebuild with gcc3.0 and also try turning off the GC. Unfortunately I can't get this to happen on a smaller test system - it's only under load that it plows into the ground.

I'll also leave symbols in this time... :/

Comment By: Martin v. L�wis (loewis) Date: 2001-10-18 03:46

Message: Logged In: YES user_id=21627

This is something to ask the pythonlabs folks; I believe Python has been purified once in a while - perhaps even with Zope extensions. I'd still suspect a bug in the Zope extensions instead of a Python bug, though: if malloc crashes, chances are high that somebody was overwriting free memory.

Comment By: Anthony Baxter (anthonybaxter) Date: 2001-10-18 03:31

Message: Logged In: YES user_id=29957

It's not a GC object. I'm positive all the extension objects are correct - I just recompiled, without the 1.5/2.0 headers around. It's a different pointer each time round, unfortunately. It also takes anything from 5 minutes to 2 hours to reproduce. I've got about 4 copies of it running now, and I've got a bunch of different core files. I've grabbed purify and an eval license, and I'm feeding it the binary.

The printf approach is probably not going to work - these are busy busy Zope servers. Instead, my plan, assuming that purify doesn't immediately spot a problem, is to change the code so that if it gets a dud GC object, it will just bust it out of the tree and let it leak, and print a message saying so. Then I can quit the program, and purify will tell me 'hey, you leaked!' and also tell me where it was allocated.

More concerning, about half the segfaults are not from the GC at all, but from realloc in PyFrame_New (line 161 of frameobject). These are the only two I'm getting - it's split 50-50 amongst the 10 coredumps I have now. I'm not sure whether to open a seperate bug for this.

Has python2.1.1 been purified? With Zope and zope's extensions?

Wow - it's amazing how this SF bug thing is so painful for conversations :)

Comment By: Martin v. L�wis (loewis) Date: 2001-10-18 03:11

Message: Logged In: YES user_id=21627

There are two options:

a) the object isn't really a GC object, i.e. has no GC header. In gdb, you can try to cast gc to PyObject*, then see if the resulting pointer has a better ob_type (this is unlikely, though, since the logic entering the object was already using fromgc/togc)

b) somebody has cleared the ob_type field.

Can you guarantee that all extension modules have been compiled with the 2.1.1 header files?

Is the problem repeatable in the sense that gc will have the same pointer value on each crash? If so, it is relatively easy to track down: just set a gdb change watchpoint on the address on the ob_type field of that address (note that setting watchpoints is not possible until there is really mapped memory on that address).

If you can't analyse it through change breakpoints, I recommend to annotate the interpreter in the following way: in pyobject_init, put a printf that prints the address and the tp_name of the type. In subtract_refs, if the ob_type slot is null, print the address of the object and abort. Then analyse the log to see whether a object really has been allocated on that address, and what its type was (make sure you consider the possibility that address are off by the delta that FROM_GC adds).

Comment By: Anthony Baxter (anthonybaxter) Date: 2001-10-17 21:58

Message: Logged In: YES user_id=29957

Ok, I have an intact core file, and a matching binary, no optimisations, nothing. This crash is showing the crash at line 166 of gcmodule.c traverse = PyObject_FROM_GC(gc)->ob_type->tp_traverse; PyObject_FROM_GC(gc)->ob_type in this case is

$24 = {ob_refcnt = 1, ob_type = 0x0}

To check my logic, I checked gc_next and gc_prev using the same GDB magic, and they correctly show up as a tuple and an instance method.

Some fiddling around seems to rule out stack space as the problem, as well. We're going to try and see if purify helps here, but the problem looks to be a junk object - I have no idea how to track this down further. Help? Would taking the horrible horrible hack of removing the object from the gc linked list if ob_type is null help? Well, it'd stop the crashes, anyway.

Comment By: Martin v. L�wis (loewis) Date: 2001-10-17 13:44

Message: Logged In: YES user_id=21627

It would be interesting what the value of "gc" is at the time of the crash. It looks like you got an object that claims to support GC but has a null tp_traverse.

Comment By: Anthony Baxter (anthonybaxter) Date: 2001-10-17 06:08

Message: Logged In: YES user_id=29957

I'm a doofus who read the gdb trace from the wrong end - too much python lately :) Nonetheless, the other end of the trace failed in gc as well - and building without GC enabled worked.

Here's the trace with debugging enabled:

#0 0xff00 in ?? () #1 0x402f0 in collect (young=0x9b538, old=0x9b544) at ./Modules/gcmodule.c:379 #2 0x405a8 in collect_generations () at ./Modules/gcmodule.c:484 #3 0x40624 in _PyGC_Insert (op=0xbc1f24) at ./Modules/gcmodule.c:507 #4 0x5a224 in PyList_New (size=0) at Objects/listobject.c:61 #5 0x21bc8 in eval_code2 (co=0x1cb370, globals=0x21bc0, locals=0x67, args=0x0, argcount=1, kws=0xf89b24, kwcount=0, defs=0x0, defcount=0, closure=0xbc1f24) at Python/ceval.c:1741

Next trick is to rebuild without any optimisation (sigh) as I suspect that it's inlined subtract_refs().

You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=105470&aid=471942&group_id=5470

Previous message: [Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7
Next message: [Python-bugs-list] [ python-Bugs-471942 ] python2.1.1 SEGV in GC on Solaris 2.7
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]