[ty] Abort process if worker thread panics by MichaReiser · Pull Request #18211 · astral-sh/ruff (original) (raw)

Conversation

Summary

This PR changes how ty's LSP handles panics in background worker threads.

Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (when JoinHandle::join is called). Eventually, job_sender.send panics
because the thread pool ran out of worker threads and the job queue overflows.

This PR aligns the behavior with rayon by aborting the entire process when any background task unexpectedly panics.
My reasoning is that containing errors shouldn't be the responsibility of the thread pool. Instead, the
request dispatching should be wrapped in a catch_unwind and handle any potential recovery there. This also
reveals that we don't have the same recovery for tasks running locally (on the main thread).

I plan on adding such recovery in the server dispatch logic as a follow up (which also adds retry logic).

Test Plan

I added a panic to the hover request handler and it aborted the process (which VS code then restarts up to 5 times).


// Channel buffer capacity is between 2 and 4, depending on the pool size.
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4));
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main motivation of the limit is to apply some form of back pressure. However, limiting the queue to 4 on e.g. a 12 core system feels overly strict because it means we'll drop messages as soon as 4 out of 12 threads have one message queued. We should at least allow a backlog of 2 tasks per thread.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own edification, can you say more about the relationship between the channel buffer size and dropping messages? Does that mean that if a channel send would block (i.e., there's no receiving ready and waiting to synchronize) then that message is ~~blocked~~ dropped?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I was wrong here.

It's a crossbeam bounded channel that starts blocking the sender if the thread pool falls behind.

I thought that this wouldn't be the case because the version on main starts to fail with Disconnected sender when all threads panicked (which drops all channel receivers). Let me revert this change.

`mypy_primer` results

No ecosystem changes detected ✅

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me!


// Channel buffer capacity is between 2 and 4, depending on the pool size.
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::min(threads * 2, 4));
let (job_sender, job_receiver) = crossbeam::channel::bounded(std::cmp::max(threads * 2, 4));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, a panic in the worker thread pool gets logged (with tracing) but it tears down the worker thread on which the background task ran.
The panic will only get surfaced once the thread pool shuts down (when JoinHandle::join is called). Eventually, job_sender.send panics
because the thread pool ran out of worker threads and the job queue overflows.

I'm a bit unsure of what this means in practice specifically the "panic will only get surfaced once the thread pool shuts down". I tried adding a panic to the hover handler on main and it does surface the panic in the logs. Or, am I misunderstanding?

I added a panic to the hover request handler and it aborted the process (which VS code then restarts up to 5 times).

I'm still not sure why should we abort the process if there's a panic in a specific handler. Wouldn't that degrade the user experience? Like, today even if there's a panic the server keeps running and users can keep using other capabilities.

Is there a way to handle it gracefully? I might need to spend some time understanding the scheduler but I don't want to block this PR for that. Happy to go ahead with this.

I'm a bit unsure of what this means in practice specifically the "panic will only get surfaced once the thread pool shuts down". I tried adding a panic to the hover handler on main and it does surface the panic in the logs. Or, am I misunderstanding?

Thanks to our global panic handler, it does surface the panic in the logs, but it also aborts the thread and we'll eventually run out. The eror value of the panic will not be dropped until we join the threads (which can be problematic if it needs to release any resources).

For that reason, I think it's the right decision to abort the process.

I'm still not sure why should we abort the process if there's a panic in a specific handler. Wouldn't that degrade the user experience? Like, today even if there's a panic the server keeps running and users can keep using other capabilities.

Sort of. It works for as long as there are still enough worker threads. Ruff/ty will abort once all threads are used up. But I agree that the experience is worse. I plan to add specific catch_panic handlers to the request / notification handlers which will give us the old behavior (except that we never run out of threads). The last step is then to also implement a retry logic if a thread unwinds due to a salsa::Cancelled, which also needs the catch_unwind in the request handler.

This is actually a more sever problem than I thought. Threads panicking has the result that the server never responds to that client request. The client might decide to not send any new request for the same method and parameters because there's already a pending request.

I think we should backport my changes to ruff

Thank you for the explanation. I think that makes sense and we should do the same for Ruff as well, it might be useful to do that after your planned follow-up work?

I plan to back port all changes of this stack to ruff

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Labels

server

Related to the LSP server

Multi-file analysis & type inference

[ty] Abort process if worker thread panics by MichaReiser · Pull Request #18211 · astral-sh/ruff (original) (raw)

Conversation

Summary

Test Plan

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mypy_primer results

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

`mypy_primer` results

`ruff-ecosystem` results