[lldb-dap] Race condition in the event handler thread · Issue #131242 · llvm/llvm-project (original) (raw)

I discovered that TestDAP_breakpointEvents.py (macOS only) was failing nondeterministically. I tracked the problem down to the event handler thread (EventThreadFunction) where the check that a breakpoint has the DAP label would sometimes return false, even though the breakpoint in question was set through DAP.

I confirmed it's a race condition with the following assert:

bool is_dap = bp.MatchesName(BreakpointBase::GetBreakpointLabel());
assert(is_dap == bp.MatchesName(BreakpointBase::GetBreakpointLabel()));

The problem is that the operation of creating a breakpoint and adding the label are not atomic. Both SB API operations are protected by the target API lock. Here's what's going on:

  1. The main thread receives a request to create a breakpoint with setBreakpoints. This calls SBTarget::BreakpointCreateByLocation which acquires the target API lock, creates the breakpoint and releases it.
  2. The event thread receives an eBreakpointEventTypeLocationsResolved event and prepares to send a breakpoint event, which involves checking if the breakpoint has the DAP label. This calls SBBreakpoint::MatchesName which acquires the target API lock, doesn't find the label, and releases it.
  3. Meanwhile on the main thread, we're trying to add the label in dap::Breakpoint::SetBreakpoint. This calls SBBreakpoint::AddName which waits on the API lock that's currently held by the event thread checking if the label is set.

In other words, creating the breakpoint and adding the label are not atomic operations. Individually they're protected by the target API mutex, but that doesn't help us since they're separate calls.

I see two ways to solve this problem:

  1. Add a mutex to the DAP instance to synchronize between the main thread and the events thread.
  2. Expose the target API mutex as part of the SB API. It's a recursive mutex so there's no risk to locking it before calling SB APIs that are protected by it. I would want this to be an RAII object so I would not expose it through SWIG.

I'm leaning towards the second options because I think the use care here matches the idea behind the API lock.