Skip to content

The main thread hang at drop_gil due to FORCE_SWITCHING after daemon thread exited #96387

@milizao

Description

@milizao

Bug report

Main thread hang at drop_gil after daemon thread exited, my Python version is 3.9.2, but I think the same issue is exists in the Python 3.10 newest version.

(gdb) info threads Id Target Id Frame * 1 Thread 0xffff909dd010 (LWP 226769) "python" 0x0000ffff909355c0 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xaaaadf99e8d4 <_PyRuntime+524>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 (gdb) bt #0 0x0000ffff909355c0 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xaaaadf99e8d4 <_PyRuntime+524>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xaaaadf99e8d8 <_PyRuntime+528>, cond=0xaaaadf99e8a8 <_PyRuntime+480>) at pthread_cond_wait.c:508 #2 __pthread_cond_wait (cond=cond@entry=0xaaaadf99e8a8 <_PyRuntime+480>, mutex=mutex@entry=0xaaaadf99e8d8 <_PyRuntime+528>) at pthread_cond_wait.c:638 #3 0x0000aaaadf35d184 in drop_gil (ceval=0xaaaadf99e820 <_PyRuntime+344>, ceval@entry=0xaaaadf300274 <PyObject_Hash+116>, ceval2=ceval2@entry=0xaaaafbc61ff0, tstate=tstate@entry=0xaaaafbc64a00) at Python/ceval_gil.h:178 #4 0x0000aaaadf2c05c4 in eval_frame_handle_pending (tstate=0xaaaafbc64a00) at Python/ceval.c:882 #5 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0xffff83929d40, throwflag=<optimized out>) at Python/ceval.c:1402 #6 0x0000aaaadf2bee1c in _PyEval_EvalFrame (throwflag=0, f=0xffff83929d40, tstate=0xaaaafbc64a00) at ./Include/internal/pycore_ceval.h:40 #7 function_code_fastcall (tstate=0xaaaafbc64a00, co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Objects/call.c:330 #8 0x0000aaaadf31ac04 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, tstate=<optimized out>) at ./Include/cpython/abstract.h:118 #9 _PyObject_CallNoArg (func=<optimized out>) at ./Include/cpython/abstract.h:172 #10 call_unbound_noarg (self=<optimized out>, func=0xffff83691ee0, unbound=<optimized out>) at Objects/typeobject.c:1515 #11 slot_tp_finalize (self=<optimized out>) at Objects/typeobject.c:6992 #12 0x0000aaaadf30011c in PyObject_CallFinalizer (self=self@entry=0xffff837ea6d0) at Objects/object.c:195 #13 0x0000aaaadf302694 in PyObject_CallFinalizerFromDealloc (self=self@entry=0xffff837ea6d0) at Objects/object.c:213 #14 0x0000aaaadf314b1c in subtype_dealloc (self=0xffff837ea6d0) at Objects/typeobject.c:1265 #15 0x0000aaaadf2f3f4c in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430 #16 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:497 #17 insertdict (mp=mp@entry=0xffff8383de80, key=0xffff837450b0, hash=-5519671433175608536, value=value@entry=0xaaaadf94be68 <_Py_NoneStruct>) at Objects/dictobject.c:1123 #18 0x0000aaaadf2f52d4 in PyDict_SetItem (op=op@entry=0xffff8383de80, key=<optimized out>, value=value@entry=0xaaaadf94be68 <_Py_NoneStruct>) at Objects/dictobject.c:1573 #19 0x0000aaaadf2ffa48 in _PyModule_ClearDict (d=0xffff8383de80) at Objects/moduleobject.c:612 #20 0x0000aaaadf2ffc84 in _PyModule_Clear (m=m@entry=0xffff837f1a40) at Objects/moduleobject.c:560 #21 0x0000aaaadf3798e8 in _PyImport_Cleanup (tstate=tstate@entry=0xaaaafbc64a00) at Python/import.c:606 #22 0x0000aaaadf38e63c in Py_FinalizeEx () at Python/pylifecycle.c:1426 #23 0x0000aaaadf2c9110 in Py_RunMain () at Modules/main.c:679 #24 0x0000aaaadf2c99f8 in pymain_main (args=args@entry=0xffffd3754d88) at Modules/main.c:707 #25 0x0000aaaadf2c9ad4 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:731 #26 0x0000ffff906c1f80 in __libc_start_main (main=0x0, argc=0, argv=0x0, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:308 #27 0x0000aaaadf2c87bc in _start () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) py-bt Traceback (most recent call first): File "/home/ossadm/zhao/runtime.py", line 8, in __del__ def __del__(self): (gdb) 

Rough analysis

In Python version 3.9.2, the take_gil is implemented as following:

static void take_gil(PyThreadState *tstate) { ... while (_Py_atomic_load_relaxed(&gil->locked)) { unsigned long saved_switchnum = gil->switch_number; unsigned long interval = (gil->interval >= 1 ? gil->interval : 1); int timed_out = 0; COND_TIMED_WAIT(gil->cond, gil->mutex, interval, timed_out); /* If we timed out and no switch occurred in the meantime, it is time  to ask the GIL-holding thread to drop it. */ if (timed_out && _Py_atomic_load_relaxed(&gil->locked) && gil->switch_number == saved_switchnum) { if (tstate_must_exit(tstate)) { MUTEX_UNLOCK(gil->mutex); PyThread_exit_thread(); // [2] } assert(is_tstate_valid(tstate)); SET_GIL_DROP_REQUEST(interp); // [1] } } ... }

A daemon thread may call SET_GIL_DROP_REQUEST and then exits from PyThread_exit_thread, so this thread will not get GIL really. This happens if the main thread is on the finalizing procedure and it calls _PyRuntimeState_SetFinalizing in Py_FinalizeEx before the daemon thread gets GIL (The daemon thread wait 5ms at most before exit).

After that, the flag gil_drop_request is set, but only the main thread is left. Then, if the main thread runs into drop_gil again, it will hang at FORCE_SWITCHING:

COND_WAIT(gil->switch_cond, gil->switch_mutex);

Reproducible example

This issue could be reproduced by:

$ python -m http.server 5678 & $ for ((i=0;;i++)); do echo $i; python -W all service.py; done

And the scripts used are:

runtime.py

import socket class TestUnclosedSocket: def __init__(self): self.s = socket.socket() def __del__(self): self.s.connect(('127.0.0.1', 5678)) ins = TestUnclosedSocket()

service.py

import threading import runtime import sys sys.setswitchinterval(0.001) def calc(): sum = 0 for i in range(100000): sum += i * i def daemon_func(): while True: calc() if __name__ == '__main__': threading.Thread(target=daemon_func, name="daemon", daemon=True).start() calc()

I change GIL switch interval to 1ms to increase the probability here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)type-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions