Issue 31426: [3.5] crash in gen_traverse(): gi_frame.ob_type=NULL, called by subtract_refs() during a GC collection (original) (raw)
Created on 2017-09-12 08:35 by iwienand, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (6)
Author: Ian Wienand (iwienand) *
Date: 2017-09-12 08:34
Using 3.5.2-2ubuntu0~16.04.3 (Xenial) we see an occasional segfault during garbage collection of a generator object
A full backtrace is attached, but the crash appears to be triggered inside gen_traverse during gc
(gdb) info args gen = 0x7f22385f0150 visit = 0x50eaa0 arg = 0x0
(gdb) print *gen $109 = {ob_base = {ob_refcnt = 1, ob_type = 0xa35760 }, gi_frame = 0x386aed8, gi_running = 1 '\001', gi_code = <code at remote 0x7f223bb42f60>, gi_weakreflist = 0x0, gi_name = 'linesplit', gi_qualname = 'linesplit'}
I believe gen_traverse is doing the following
static int gen_traverse(PyGenObject *gen, visitproc visit, void *arg) { Py_VISIT((PyObject *)gen->gi_frame); Py_VISIT(gen->gi_code); Py_VISIT(gen->gi_name); Py_VISIT(gen->gi_qualname); return 0; }
The problem here being that this generator's gen->gi_frame has managed to acquire a NULL object type but still has references
(gdb) print *gen->gi_frame $112 = {ob_base = {ob_base = {ob_refcnt = 2, ob_type = 0x0}, ob_size = 0}, f_back = 0x0, f_code = 0xca3e4fd8950fef91, ...
Thus it gets visited and it doesn't go well.
I have attached the py-bt as well, it's very deep with ansible, multiprocessing forking, imp.load_source() importing ... basically a nightmare. I have not managed to get it down to any sort of minimal test case unfortunately. This happens fairly infrequently, so suggests a race. The generator in question has a socket involved:
def linesplit(socket): buff = socket.recv(4096).decode("utf-8") buffering = True while buffering: if "\n" in buff: (line, buff) = buff.split("\n", 1) yield line + "\n" else: more = socket.recv(4096).decode("utf-8") if not more: buffering = False else: buff += more if buff: yield buff
Wild speculation but maybe something to do with finalizing generators with file-descriptors across fork()?
At this point we are trying a work-around of not having the above socket reading routine in a generator but just a "regular" loop. As it triggers as part of a production roll-out I'm not sure we can do too much more debugging. Unless this rings any immediate bells for people, we can probably just have this for tracking at this point. [1] is the original upstream issue
[1] https://storyboard.openstack.org/#!/story/2001186#comment-17441
Author: STINNER Victor (vstinner) *
Date: 2017-09-12 09:22
Python 3.5 moved to security only fixes recently, it doesn't accept bug fixes anymore: https://devguide.python.org/#status-of-python-branches
It would be nice to Python 3.5.4 at least, or better: Python 3.6.x.
(gdb) print *gen->gi_frame $112 = {ob_base = {ob_base = {ob_refcnt = 2, ob_type = 0x0}, ob_size = 0}, f_back = 0x0, f_code = 0xca3e4fd8950fef91, ...
ob_type should never be NULL for an object still reachable and with a reference count different than zero. It seems like a bug in a C extension. It would help to test your application on a Python compiled in debug mode.
Author: STINNER Victor (vstinner) *
Date: 2017-09-12 09:31
I pointed bpo-26617 to Ian since Python 3.5.2 contains this GC crash, but it seems like it's not the same bug.
Author: STINNER Victor (vstinner) *
Date: 2017-09-12 09:36
I pointed bpo-26617 to Ian since Python 3.5.2 contains this GC crash, but it seems like it's not the same bug.
Ah, I found an issue which had bpo-26617 in subtract_refs(): https://stackoverflow.com/questions/39990934/debugging-python-segmentation-faults-in-garbage-collection
So it's not only update_refs() called during GC collection.
Author: Victor Zhestkov (vzhestkov)
Date: 2021-07-20 07:34
It seems I have the same segfault, but with 3.6.13 python shipped with SLE15SP2. It's salt-api process under intensive usage. I'm able to reproduce it, but can't isolate due to the service complexity. In some cases it takes about 5 minutes to be crashed, but in others it could run with no crash for about an hour or more (I keep the workload on this service with a kind of stress test).
Author: STINNER Victor (vstinner) *
Date: 2021-08-04 15:32
This bug report mentions Python 3.5 and 3.6 which no longer accept bugfixes. Since nobody reported issues on Python 3.9 and newer (which still accept bugfixes), I close the issue as out of date.
Victor Zhestkov:
It seems I have the same segfault, but with 3.6.13 python shipped with SLE15SP2. It's salt-api process under intensive usage. I'm able to reproduce it, but can't isolate due to the service complexity. In some cases it takes about 5 minutes to be crashed, but in others it could run with no crash for about an hour or more (I keep the workload on this service with a kind of stress test).
See my notes to debug crashes happening during GC collections: https://pythondev.readthedocs.io/debug_tools.html#debug-crash-in-garbage-collection-visit-decref
You can try to use a way smaller GC threshold: call gc.set_threshold(5) at the very beginning of your application.
I strongly advice you to use a debug mode of Python, since it includes way more debug modes.
I also strongly advice you to upgrade Python. I added many debug checks for object consistency in the GC in recent Python releases (3.8, 3.9, 3.10) and when a bug arises, Python dumps way more information about the faulty Python object.
Good luck for debug it. But please don't comment this closed issue. Python 3.6 is no longer supported.
History
Date
User
Action
Args
2022-04-11 14:58:52
admin
set
github: 75607
2021-08-04 15:32:05
vstinner
set
status: open -> closed
resolution: out of date
messages: +
stage: resolved
2021-07-20 08:50:43
mcepl
set
nosy: + mcepl
versions: + Python 3.6
2021-07-20 07:35:35
vzhestkov
set
files: + gbd-bt-brief.txt
2021-07-20 07:35:18
vzhestkov
set
files: + py-bt.txt
2021-07-20 07:35:00
vzhestkov
set
files: + gdb-bt-full.txt
nosy: + vzhestkov
messages: +
2017-09-12 09:42:55
vstinner
set
title: [3.5] gen_traverse(): gi_frame.ob_type=NULL when called by subtract_refs() during a GC collection -> [3.5] crash in gen_traverse(): gi_frame.ob_type=NULL, called by subtract_refs() during a GC collection
2017-09-12 09:36:19
vstinner
set
messages: +
2017-09-12 09:34:02
vstinner
set
nosy: + yselivanov
2017-09-12 09:33:35
vstinner
set
title: Segfault during GC of generator object; invalid gi_frame? -> [3.5] gen_traverse(): gi_frame.ob_type=NULL when called by subtract_refs() during a GC collection
2017-09-12 09:31:19
vstinner
set
nosy: + pitrou, serhiy.storchaka
messages: +
2017-09-12 09:22:31
vstinner
set
nosy: + vstinner
messages: +
2017-09-12 08:35:51
iwienand
set
files: + crash-py-bt.txt
2017-09-12 08:35:09
iwienand
create