Issue 28673: pyro4 with more than 15 threads often crashes 2.7.12 (original) (raw)

Issue28673

Created on 2016-11-12 08:31 by pwp333, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
core_python2.7.10.gz pwp333,2016-11-15 22:07
28673-reproduce.py methane,2016-11-24 11:58
28673-reproduce2.py methane,2016-11-25 03:07
finish-deadlock.py methane,2016-11-25 13:39
28673-fix-deadlock.patch methane,2016-11-27 07:38 review
Messages (19)
msg280640 - (view) Author: Michael Hu (pwp333) Date: 2016-11-12 08:31
When using pyro4 with more than 15 threads, python 2.7.12 cores frequently (>60% time) Note "v" (op in frame 1) in frame 2 is NULL which has some value in frame 3. So some other thread cleans it. === gdb === Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `python /etc/remoting/remoting_agent.zip run --system'. Program terminated with signal SIGSEGV, Segmentation fault. #0 PyDict_SetItem (op=op@entry=0x0, key=key@entry=0x7f0aa11cd420, value=value@entry=0x920ce0 <_PyExc_AttributeError.12478>) at ../Objects/dictobject.c:832 warning: Source file is more recent than executable. 832 if (!PyDict_Check(op)) { (gdb) bt #0 PyDict_SetItem (op=op@entry=0x0, key=key@entry='exc_type', value=value@entry=<type at remote 0x920ce0>) at ../Objects/dictobject.c:832 #1 0x0000000000529e86 in PyDict_SetItemString (item=<type at remote 0x920ce0>, key=0x61d02a "exc_type", v=0x0) at ../Objects/dictobject.c:2469 #2 PySys_SetObject (name=name@entry=0x61d02a "exc_type", v=v@entry=<type at remote 0x920ce0>) at ../Python/sysmodule.c:83 #3 0x000000000055acc0 in set_exc_info (tb=<traceback at remote 0x7f0aa1232878>, value=exceptions.AttributeError("'NoneType' object has no attribute 'SHUT_RDWR'",), type=<type at remote 0x920ce0>, tstate=0x95ad10) at ../Python/ceval.c:3736 #4 PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a9e929b60, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 463, in close (self=<SocketConnection at remote 0x7f0a9e924510>), throwflag=throwflag@entry=0) at ../Python/ceval.c:3251 #5 0x0000000000559c27 in fast_function (nk=, na=, n=1, pp_stack=0x7ffce435ab00, func=<function at remote 0x7f0a9f6c0230>) at ../Python/ceval.c:4435 #6 call_function (oparg=, pp_stack=0x7ffce435ab00) at ../Python/ceval.c:4370 #7 PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a40000b50, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0a9e924510>), throwflag=throwflag@entry=0) at ../Python/ceval.c:2987 #8 0x000000000056aa8a in PyEval_EvalCodeEx (closure=, defcount=, defs=0x0, kwcount=, kws=, argcount=1073744720, args=, locals=0x0, globals=, co=) at ../Python/ceval.c:3582 #9 function_call (func=func@entry=<function at remote 0x7f0a9f6c00c8>, arg=arg@entry=(<SocketConnection at remote 0x7f0a9e924510>,), kw=kw@entry=0x0) at ../Objects/funcobject.c:523 #10 0x00000000004be724 in PyObject_Call (kw=0x0, arg=(<SocketConnection at remote 0x7f0a9e924510>,), func=<function at remote 0x7f0a9f6c00c8>) at ../Objects/abstract.c:2546 #11 instancemethod_call.8803 (func=<function at remote 0x7f0a9f6c00c8>, arg=(<SocketConnection at remote 0x7f0a9e924510>,), kw=0x0) at ../Objects/classobject.c:2602 #12 0x00000000004c4683 in PyObject_Call (kw=0x0, arg=(), func=<instancemethod at remote 0x7f0a9f7b5b90>) at ../Objects/abstract.c:2546 #13 PyEval_CallObjectWithKeywords (kw=0x0, arg=(), func=<instancemethod at remote 0x7f0a9f7b5b90>) at ../Python/ceval.c:4219 #14 slot_tp_del.25647 (self=self@entry=<SocketConnection at remote 0x7f0a9e924510>) at ../Objects/typeobject.c:5844 #15 0x00000000004c5880 in subtype_dealloc.25650 (self=<SocketConnection at remote 0x7f0a9e924510>) at ../Objects/typeobject.c:1002 #16 0x00000000005336cf in dict_dealloc.18423 (mp=0x7f0a9e93c398) at ../Objects/dictobject.c:1040 #17 0x00000000004c56e7 in subtype_dealloc.25650 ( self=<ClientConnectionJob(daemon=<Daemon(_Daemon__mustshutdown=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f0a9f759790>, acquire=<built-in method acquire of thread.lock object at remote 0x7f0a9f759790>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f0a9f759790>) at remote 0x7f0a9e9241d0>) at remote 0x7f0a9e924190>, objectsById={'root': <RootService(_scheduler=<Scheduler(_lock=<thread.lock at remote 0x7f0aa1211d50>, _operations={'cookie_9de0e8d93ed84452a1ce8cc92268979e': <Operation(_exception=None, _result='obj_450f0972022e41b2b7c1f29f49bdca83', _action=<function at remote 0x7f0a9f7b89b0>, _lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f0a9f759370>, _RLock__count=0) at remote 0x7f0a9e936190>, _started_at=<float at remote 0xab90f8>, cookie='cookie_9de0e8d93ed84452a1ce8cc92268979e', _complete=<_Condition(_Condition__lock=...(truncated)) at ../Objects/typeobject.c:1035 #18 0x0000000000534b93 in frame_dealloc.14921 (f=Frame 0x7f0a9f6d49d8, for file /etc/remoting/remoting_agent.zip/Pyro4/socketserver/threadpool.py, line 39, in run ()) at ../Objects/frameobject.c:458 #19 0x00000000004d86c6 in tb_dealloc.46270 (tb=0x7f0aa1278f38) at ../Python/traceback.c:28 #20 0x00000000004d87e1 in tb_dealloc.46270 (tb=0x7f0aa1278ea8) at ../Python/traceback.c:27 #21 0x00000000005336cf in dict_dealloc.18423 (mp=0x7f0aa1277280) at ../Objects/dictobject.c:1040 #22 0x00000000005a52db in PyInterpreterState_Clear (interp=0x95ac80) at ../Python/pystate.c:111 #23 0x0000000000423aea in Py_Finalize () at ../Python/pythonrun.c:500 #24 0x00000000004672c5 in Py_Main (argc=, argv=0x7ffce435b2d8) at ../Modules/main.c:665 #25 0x00007f0aa0aa7f45 in __libc_start_main (main=0x4672f4
, argc=4, argv=0x7ffce435b2d8, init=, fini=, rtld_fini=, stack_end=0x7ffce435b2c8) at libc-start.c:287 #26 0x000000000057c581 in _start () ==== Google shows similar trace has been reported back in 2009 for python 2.6 https://bugs.launchpad.net/ubuntu/+source/system-config-printer/+bug/478071
msg280664 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-11-12 17:30
Can you provide a short sample that causes this error? Without a way to reproduce it, there's not a lot we can do.
msg280750 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-14 11:04
Traceback tells: #22 0x00000000005a52db in PyInterpreterState_Clear (interp=0x95ac80) at ../Python/pystate.c:111 Py_CLEAR(interp->sysdict); While cleaning up interp->sysdict... #19 0x00000000004d86c6 in tb_dealloc.46270 (tb=0x7f0aa1278f38) at ../Python/traceback.c:28 #20 0x00000000004d87e1 in tb_dealloc.46270 (tb=0x7f0aa1278ea8) at ../Python/traceback.c:27 Removing some traceback from sysdict. Maybe, sys.exc_traceback. #17 0x00000000004c56e7 in subtype_dealloc.25650 ( self=<ClientConnectionJob(daemon=<Daemon(_Daemon__mustshutdown=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f0a9f759790>, acquire=<built-in method acquire of thread.lock object at remote 0x7f0a9f759790>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f0a9f759790>) at remote 0x7f0a9e9241d0>) at remote 0x7f0a9e924190>, objectsById={'root': <RootService(_scheduler=<Scheduler(_lock=<thread.lock at remote 0x7f0aa1211d50>, _operations={'cookie_9de0e8d93ed84452a1ce8cc92268979e': <Operation(_exception=None, _result='obj_450f0972022e41b2b7c1f29f49bdca83', _action=<function at remote 0x7f0a9f7b89b0>, _lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f0a9f759370>, _RLock__count=0) at remote 0x7f0a9e936190>, _started_at=<float at remote 0xab90f8>, cookie='cookie_9de0e8d93ed84452a1ce8cc92268979e', _complete=<_Condition(_Condition__lock=...(truncated)) at ../Objects/typeobject.c:1035 #18 0x0000000000534b93 in frame_dealloc.14921 (f=Frame 0x7f0a9f6d49d8, for file /etc/remoting/remoting_agent.zip/Pyro4/socketserver/threadpool.py, line 39, in run ()) at ../Objects/frameobject.c:458 Removing frame object owned by the traceback. ... #7 PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a40000b50, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0a9e924510>), throwflag=throwflag@entry=0) at ../Python/ceval.c:2987 #4 PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a9e929b60, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 463, in close (self=<SocketConnection at remote 0x7f0a9e924510>), throwflag=throwflag@entry=0) at ../Python/ceval.c:3251 Calling SocketConnection.__del__. It calls SocketConnection.close. #3 0x000000000055acc0 in set_exc_info (tb=<traceback at remote 0x7f0aa1232878>, value=exceptions.AttributeError("'NoneType' object has no attribute 'SHUT_RDWR'",), type=<type at remote 0x920ce0>, tstate=0x95ad10) at ../Python/ceval.c:3736 SocketConnection.close raise AttributeError. #2 PySys_SetObject (name=name@entry=0x61d02a "exc_type", v=v@entry=<type at remote 0x920ce0>) at ../Python/sysmodule.c:83 Trying to register the exception to sys. #0 PyDict_SetItem (op=op@entry=0x0, key=key@entry='exc_type', value=value@entry=<type at remote 0x920ce0>) at ../Objects/dictobject.c:832 But interp->sysdict is NULL already.
msg280896 - (view) Author: Michael Hu (pwp333) Date: 2016-11-15 22:07
Core is uploaded for python 2.7.10 to assist debugging.
msg280897 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-15 22:11
Try to get the Python traceback of thread on the crash: try the faulthandler module.
msg280909 - (view) Author: Michael Hu (pwp333) Date: 2016-11-16 02:31
(gdb) py-bt #4 Frame 0x7f0ab7a2db60, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 463, in close (self=<SocketConnection at remote 0x7f0ab7a24190>) self.sock.shutdown(socket.SHUT_RDWR) #7 Frame 0x7f0ab0001760, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0ab7a24190>) self.close() (gdb) py-list 458 def recv(self, size): 459 return receiveData(self.sock, size) 460 461 def close(self): 462 try: >463 self.sock.shutdown(socket.SHUT_RDWR) 464 except (OSError, socket.error): 465 pass 466 try: 467 self.sock.close() 468 except AttributeError: (gdb) f 7 #7 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7f0ab0001760, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0ab7a24190>), throwflag=throwflag@entry=0) at ../Python/ceval.c:2681 2681 in ../Python/ceval.c (gdb) py-list 448 def __init__(self, sock, objectId=None): 449 self.sock = sock 450 self.objectId = objectId 451 452 def __del__(self): >453 self.close() 454 455 def send(self, data): 456 sendData(self.sock, data) 457 458 def recv(self, size): (gdb)
msg280911 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-16 04:51
Can you check backtrace of main thread?
msg280921 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-16 07:13
Ok but can you also try faulthandler to see all frames of all threads?
msg280922 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-16 07:15
By the way, you can try your application on Python 3.6 using PYTHONMALLOC=debug env var. Or retry your application on Python 2.7 recompiled using ./configure --with-pydebug. I would like to know if Python sees a buffer overflow.
msg281067 - (view) Author: Michael Hu (pwp333) Date: 2016-11-17 23:39
Could you please look into the core (attached into this bug) to figure out? This issue does not happen always. We are tight on resource to reproduce this.
msg281579 - (view) Author: Irmen de Jong (irmen) (Python triager) Date: 2016-11-23 19:00
Due to lack of example code to reproduce the issue, and because I'm mildly interested in this bug because it mentions Pyro4 (because I'm the author of that) I've tried to crash my system myself using Pyro4 and a simple torture test but it trucked on just fine. Pyro4 is not doing any "strange" things as far as I am aware. It does have its own (simple) thread pool of regular Python threads that are handling incoming proxy connections. (Had Pyro4 been doing weird things I suppose Python itself still should never core dump on the user but rather raise a regular exception if something was wrong)
msg281630 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-24 11:58
I can't get same C traceback yet. But attached script segfaults by different C traceback. I think this issue is dup of #1856
msg281665 - (view) Author: Irmen de Jong (irmen) (Python triager) Date: 2016-11-24 22:28
The 28673-reproduce.py didn't crash on any of the systems I've tried it on. Are you sure it is complete? It looks like a part is missing.
msg281676 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-25 03:07
I'm sorry, I had uploaded script with wrong parameter. Additionally, this script fails by chance. $ while true; do python2 28673-reproduce2.py | break; echo -n .; done .....Segmentation fault (core dumped)
msg281680 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-25 06:52
Python 3 would have #21963 too. If we can fix it, we can fix this issue too.
msg281718 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-25 13:39
This script can cause deadlock on Python 3.6. 1. Other threads can register traceback to sys module after sys module dict is cleaned up. 2. In Py_Finalize, __del__ method of arbitrary object in the traceback can be called. 3. The __del__ method calls threading.Thread.start() 4. Thread.start() waits self.__started event which is set by Thread.__bootstrap 5. While finalizing, threads other than main thread exits when getting GIL. So Thread.__bootstrap() won't be called. 6. Thread.start() in main thread waits self.__started event forever. I suggest thread_PyThread_start_new_thread() raise RuntimeError("can't start thread: finalizing")
msg281721 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-25 14:25
> I suggest thread_PyThread_start_new_thread() raise RuntimeError("can't start thread: finalizing") Where *exactly*? Py_FinalizeEx() first calls wait_for_thread_shutdown() and *then* sets _Py_Finalizing to tstate. Does wait_for_thread_shutdown() complete on this bug? Note: It's not the first time that I noticed a deadlock around this code :-/ I recall a silly bug when you press CTRL+c while wait_for_thread_shutdown() is running. See for example threading_shutdown_interrupted.py attached to the issue #20526. Run it and quickly press CTRL+c. Sometime, it does crash Python with a segfault... --- haypo@selma$ ./python threading_shutdown_interrupted.py ..........................................................................................................................................................................................................Exception ignored in: <module 'threading' from '/home/haypo/prog/python/default/Lib/threading.py'> Traceback (most recent call last): File "/home/haypo/prog/python/default/Lib/threading.py", line 1290, in _shutdown t.join() File "/home/haypo/prog/python/default/Lib/threading.py", line 1056, in join . self._wait_for_tstate_lock() File "/home/haypo/prog/python/default/Lib/threading.py", line 1072, in _wait_for_tstate_lock ... elif lock.acquire(block, timeout): File "threading_shutdown_interrupted.py", line 29, in killer .. raise KeyboardInterrupt() KeyboardInterrupt: ^CSegmentation fault (core dumped) --- Python finalization is a very fragile and complex task :-/
msg281811 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-27 07:38
> Where *exactly*? See attached patch. But python with this patch can deadlock in other state. Main thread wait GIL but no other living threads have GIL. It seems caused by other issue around finalization and multithreading. > Py_FinalizeEx() first calls wait_for_thread_shutdown() and *then* sets _Py_Finalizing to tstate. > Does wait_for_thread_shutdown() complete on this bug? Yes. wait_for_thread_shutdown() waits only daemon threads. After _Py_Finalizing = tstate, main thread calls many __del__ methods while shutdown. If one of them starts new thread, this deadlock happens.
msg335594 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-02-15 09:56
I think this issue is fixed by #21963.
History
Date User Action Args
2022-04-11 14:58:39 admin set github: 72859
2019-02-15 09:56:24 methane set status: open -> closedresolution: fixedmessages: + stage: test needed -> resolved
2016-11-27 07:38:24 methane set files: + 28673-fix-deadlock.patchkeywords: + patchmessages: +
2016-11-25 14:25:22 vstinner set messages: +
2016-11-25 13:39:14 methane set files: + finish-deadlock.pymessages: +
2016-11-25 06:52:40 methane set messages: +
2016-11-25 03:07:22 methane set files: + 28673-reproduce2.pymessages: +
2016-11-24 22:28:24 irmen set messages: +
2016-11-24 11:58:59 methane set files: + 28673-reproduce.pymessages: +
2016-11-23 19:00:53 irmen set nosy: + irmenmessages: +
2016-11-18 17:35:55 terry.reedy set title: When using pyro4 with more than 15 threads, python 2.7.12 cores frequently -> pyro4 with more than 15 threads often crashes 2.7.12stage: test needed
2016-11-17 23:39:05 pwp333 set messages: +
2016-11-16 07:15:23 vstinner set messages: +
2016-11-16 07:13:30 vstinner set messages: +
2016-11-16 04:51:16 methane set messages: +
2016-11-16 02:31:06 pwp333 set messages: +
2016-11-15 22:11:00 vstinner set nosy: + vstinnermessages: +
2016-11-15 22:07:44 pwp333 set files: + core_python2.7.10.gzmessages: +
2016-11-14 11:04:36 methane set nosy: + methanemessages: +
2016-11-12 17:30:22 eric.smith set nosy: + eric.smithmessages: +
2016-11-12 08:31:38 pwp333 create