fix for mutex slim being released too early during flush operation by arsnyder16 · Pull Request #1585 · StackExchange/StackExchange.Redis (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Fixes #1438

When flushing during FlushSync we are specifying a timeout to the Task, but when this timeout gets hit the underlying task is still running. The thread that invoked FlushSync continues and releases the MutexSlim that is protecting the io pipe allowing another thread to obtain the MutexSlim and try to run operations on the pipe.

To fix use the cancellation token that can be passed to PipeWriter.FlushAsync.

An alternative to this is to not specify a timeout. There are 2 additional locations that flush the pipe neither of those locations specify a timeout they just simply await the Task. Generally the timeouts are driven by how long an operation waits to obtain the MutexSlim

Looks like the CI checks are failing for some unrelated issue, seems like the other PRs have the same issue

Normally, time constraints are limiting how much I can get done on PRs etc, but this is very interesting. I'm going to carve out some of tomorrow morning to look at this, because if you're right (and I'm assuming you are, haven't looked deep at the code yet) it could be really useful! Thanks.

…

Thanks, i would be interested in hearing your thoughts, since i am not overly familiar with the library. Feel free to adjust it in another way, this seemed to make the most sense to me.

I can't speak to the exact fix, but the described behavior matches exactly the symptoms we've seen provoke the issue.

Okay, I've walked through the precise fixes. Normally, I'd like to see a continuation bound for on cancel but that's not really the model or style of the lib, I think?

I agree the timeout can cause the lock to be released (incorrectly) while the task is actually making progress in the background. so I think this change is correct.

Merging main in here for updated build checks

Do we have a timeline for accepting the PR? This issue continues to block us from rolling our version forward.

Sorry, I had planned to already look at it, but things got ... busy. I hope to look tomorrow (although I appreciate that I've said that before)

No worries, we're in a stable-ish state on an older version, and life is pretty complicated for all of us right now.

I concur with the problem, but I'm very dubious of allocating a CTS with scheduled timeout every call; I wonder instead if we could allocate and reuse a CTS, and only schedule the timeout (dooming it for reuse) when we get a non-sync response; something like this: e1e7189

(I recommend viewing without whitespace deltas: e1e7189?w=1)

thoughts? if you agree that this minor tweak makes sense, feel free to cherry-pick or merge it into your branch, and we can get this merged.

A very nuanced spot - congrats for finding this and offering a fix.

merged with the CTS tweak as proposed