[lldb-dap] Race condition in the event handler thread · Issue #131242 · llvm/llvm-project (original) (raw)
I discovered that TestDAP_breakpointEvents.py
(macOS only) was failing nondeterministically. I tracked the problem down to the event handler thread (EventThreadFunction
) where the check that a breakpoint has the DAP label would sometimes return false, even though the breakpoint in question was set through DAP.
I confirmed it's a race condition with the following assert:
bool is_dap = bp.MatchesName(BreakpointBase::GetBreakpointLabel());
assert(is_dap == bp.MatchesName(BreakpointBase::GetBreakpointLabel()));
The problem is that the operation of creating a breakpoint and adding the label are not atomic. Both SB API operations are protected by the target API lock. Here's what's going on:
- The main thread receives a request to create a breakpoint with
setBreakpoints
. This callsSBTarget::BreakpointCreateByLocation
which acquires the target API lock, creates the breakpoint and releases it. - The event thread receives an
eBreakpointEventTypeLocationsResolved
event and prepares to send abreakpoint
event, which involves checking if the breakpoint has the DAP label. This callsSBBreakpoint::MatchesName
which acquires the target API lock, doesn't find the label, and releases it. - Meanwhile on the main thread, we're trying to add the label in
dap::Breakpoint::SetBreakpoint
. This callsSBBreakpoint::AddName
which waits on the API lock that's currently held by the event thread checking if the label is set.
In other words, creating the breakpoint and adding the label are not atomic operations. Individually they're protected by the target API mutex, but that doesn't help us since they're separate calls.
I see two ways to solve this problem:
- Add a mutex to the
DAP
instance to synchronize between the main thread and the events thread. - Expose the target API mutex as part of the SB API. It's a recursive mutex so there's no risk to locking it before calling SB APIs that are protected by it. I would want this to be an RAII object so I would not expose it through SWIG.
I'm leaning towards the second options because I think the use care here matches the idea behind the API lock.