Add fast WaitOnAddress-based thread parker for Windows. by m-ou-se · Pull Request #77618 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation29 Commits8 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

This adds a fast futex-based thread parker for Windows. It either uses WaitOnAddress+WakeByAddressSingle or NT Keyed Events (NtWaitForKeyedEvent+NtReleaseKeyedEvent), depending on which is available. Together, this makes this thread parker work for Windows XP and up. Before this change, park()/unpark() did not work on Windows XP: it needs condition variables, which only exist since Windows Vista.

Unfortunately, NT Keyed Events are an undocumented Windows API. However:

This API is relatively simple with obvious behaviour, and there are several (unofficial) articles documenting the details. [1]
parking_lot has been using this API for years (on Windows versions before Windows 8). [2] Many big projects extensively use parking_lot, such as servo and the Rust compiler itself.
It is the underlying API used by Windows SRW locks and Windows critical sections. [3] [4]
The source code of the implementations of Wine, ReactOs, and Windows XP are available and match the expected behaviour.
The main risk with an undocumented API is that it might change in the future. But since we only use it for older versions of Windows, that's not a problem.
Even if these functions do not block or wake as we expect (which is unlikely, see all previous points), this implementation would still be memory safe. The NT Keyed Events API is only used to sleep/block in the right place.

[1]: http://www.locklessinc.com/articles/keyed_events/
[2]: Amanieu/parking_lot@43abbc964e
[3]: https://docs.microsoft.com/en-us/archive/msdn-magazine/2012/november/windows-with-c-the-evolution-of-synchronization-in-windows-and-c
[4]: Windows Internals, Part 1, ISBN 9780735671300

The choice of fallback API is inspired by parking_lot(_core), but the implementation of this thread parker is different. While parking_lot has no use for a fast path (park() directly returning if unpark() was already called), this implementation has a fast path that returns without even checking which waiting/waking API to use, as the same atomic variable with compatible states is used in all cases.

Clarify memory ordering and spurious wakeups.

r? @kennytm

(rust_highfive has picked a reviewer for you, use r? to override)

@rustbot modify labels: +T-libs +O-windows +A-concurrency

@retep998 If I understood correctly, you're the right person to ask about Windows APIs in Rust?

AFAICT this looks correct. Note that an acquire operation only synchronizes with the most recent release operation on a particular atomic variable (see LLVM Lang spec), so two unparks followed by a park all on different threads would only have the park thread synchronize with whichever of the two unparks came later, making reading any data written by the earlier unparking thread technically a data race. This is technically allowed by park's documentation, but could be unexpected.

I really cannot condone the use of undocumented APIs. I would highly recommend dropping support for Windows XP. There are so many great concurrency APIs that are available starting with Windows 7. This would dramatically simplify the code and avoid the need to dynamically load anything.

I did however want to comment on the compat_fn macro - I was curious how that was implemented. It uses GetModuleHandleW to find the module to pass to GetProcAddress when it should instead be using LoadLibraryW. Using GetModuleHandleW assumes the module has already been loaded by the calling process. That may be a safe bet for certain functions that you want to “delay load” but certainly not all. LoadLibraryW will just go the extra mile of also loading the module if it has not already been loaded. There's also no need to call FreeLibrary for such APIs so it should be a pretty simple fix.

I really cannot condone the use of undocumented APIs. I would highly recommend dropping support for Windows XP. There are so many great concurrency APIs that are available starting with Windows 7. This would dramatically simplify the code and avoid the need to dynamically load anything.

Windows 7 does not have anything futex-like (like WaitOnAddress for Windows 8+) though.

Do you think there's a significant difference between fully dropping support for XP (and Vista, etc.) versus a best-effort implementation using this undocumented API? Lots of Rust software already uses this API, including Rustc itself, through parking_lot.

I did however want to comment on the compat_fn macro - I was curious how that was implemented. It uses GetModuleHandleW to find the module to pass to GetProcAddress when it should instead be using LoadLibraryW. Using GetModuleHandleW assumes the module has already been loaded by the calling process. That may be a safe bet for certain functions that you want to “delay load” but certainly not all. LoadLibraryW will just go the extra mile of also loading the module if it has not already been loaded. There's also no need to call FreeLibrary for such APIs so it should be a pretty simple fix.

Thanks! Opened #78444 for that.

I think WaitOnAddress and friends are some of the few that are new to Windows 8. Most of the slim locks, condition variables, thread pool primitives - all of which are incredibly fast and powerful - are available on Windows 7 (some Vista) and later. Those are indispensable for high-performance concurrency and completely missing from Windows XP.

I haven't looked too closely at the Rust thread parker requirements, but why not you just use an event e.g. CreateEventW/SetEvent/WaitForSingleObject? Events may sound "heavyweight" but they are very efficient in practice (and they've been around forever).

@oldnewthing may also have some suggestions for WaitOnAddress alternatives.

For Vista+ there's SlimRW/ConditionVariable? I think they're basically wrappers around keyed events anyway, no?

That's correct. I wrote about that here.

For Vista+ there's SlimRW/ConditionVariable? I think they're basically wrappers around keyed events anyway, no?

Yeah, that's what Rust's Mutex and Condvar use when available.

WaitOnAddress/keyed events are be a much nicer building block, especially for things that are not exactly mutexes or condition variables, because it allows inlining the atomic compare-exchange fast path and only requires a call to the API when waiting is necessary. A thread parker based on a condition variable also needs a mutex and a separate state variable too, making it less efficient and more complicated than just WaitOnAddress on a single byte.

The std code does not know at compile time on which Windows version it is going to run, so if it has an implementation based on WaitOnAddress, it still needs a fallback that works on Windows 7. The keyed events api as a fallback fits nicely, and doesn't even require checking availablility of the new api in the fast path (as in this PR). Any other fallback will need this check on every park()/unpark() (just like Mutex already does to decide between the slim lock and a critical section), or always use the older API.

Replacing keyed events with the classic CreateEvent/SetEvent/WaitForSingleObject is fairly straightforward. I haven't tried to compile this but the general idea is hopefully clear https://gist.github.com/a40c879e6bdcfe3ca4b67948302dd337 (though note there is some error handling missing there).

Edit: Updated for per-parker event handles

Replacing keyed events with the classic CreateEvent/SetEvent/WaitForSingleObject is fairly straightforward. I haven't tried to compile this but the general idea is hopefully clear https://gist.github.com/a40c879e6bdcfe3ca4b67948302dd337 (though note there is some error handling missing there).

Uh, that uses a single event object for all thread parkers, without using anything as key. How would that work? How would unpark() cause the right park()'ed thread to wake up?

Replacing keyed events with the classic CreateEvent/SetEvent/WaitForSingleObject is fairly straightforward. I haven't tried to compile this but the general idea is hopefully clear https://gist.github.com/a40c879e6bdcfe3ca4b67948302dd337 (though note there is some error handling missing there).

Uh, that uses a single event object for all thread parkers, without using anything as key. How would that work? How would unpark() cause the right park()'ed thread to wake up?

Ah, true. Hmm. What's the cost of an extra Handle in Parker? Not quite as slim as the futex impls but still better than the generic one.

To summarise some assorted thoughts based on discussions here and elsewhere:

One of the issues with the current compatibility shim is that it hard codes legacy dll names. Loading of the new sync APIs doesn't have this issue as it uses the API set ("api-ms-win-core-synch-l1-2-0").

Another issue is that loading a module restricts where Rust code can be used. However, if this dynamic loading becomes limited to park/unpark then it at least reduces the chances of this being an issue in practice (though park is used by sync::once).

Using undocumented ntdll functions is more controversial. Perhaps it would help if this was gated by an explicit runtime check of the OS version? I mean, people still aren't going to be comfortable with this but at least it would be explicitly limited.

These issues may be solved more satisfactorily in the future by either a target_os_version cfg option or by more fine grained separation of targets. Either would allow for different implementations at compile time. But until then there doesn't seem to be a way to square this circle without using undocumented APIs?

They're not really legacy DLL names - they're the actual DLL names. The API sets are just forwarders but the OS loader can figure it out either way. The actual DLL names are preferred IMO because they work down-level before API sets were introduced. The official onecore and onecoreuap libs we use internally and provide through the SDK redirect to the actual DLL names where possible.

Dynamic loading is a normal part of Windows programming. Version checks on the other hand are extremely bad for reliability and compat. I'd sooner recommend the undocumented APIs than have you start checking version numbers. 😉

Ah, elsewhere it was mentioned that hard coding dll names like "kernel32" made things more difficult in the lab test when the OS being used doesn't have kernel32.

I am not confident enough about the tradeoffs here — it would be good to get some additional sets of critical eyes on this implementation from the libs team.

@rust-lang/libs
@rfcbot fcp merge

Team member @dtolnay has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

I like the fact that this matches Wine and ReactOS. If we're doing what they're doing, then I think we're in good company.

However, I am a bit skeptical of adding support for Windows XP where it seems like none existed before? I don't understand everything involved here, but it sounds like the XP support is incidental? That is, even if we didn't endeavor to add XP support, we would still get it by virtue of supporting Windows 7. Am I understanding that correctly?

it sounds like the XP support is incidental? That is, even if we didn't endeavor to add XP support, we would still get it by virtue of supporting Windows 7. Am I understanding that correctly?

Yes.

I don't think an FCP is needed here since this doesn't change any user-visible APIs. I reviewed both code paths and they look correct to me, so I'm just going to merge this as it is.

@bors r+

📌 Commit 03fb61c has been approved by Amanieu

bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Dec 14, 2020

m-ou-se changed the title~~Add fast futex-based thread parker for Windows.~~ Add fast WaitOnAddress-based thread parker for Windows.

Aug 26, 2021

fn keyed_event_handle() -> c::HANDLE {
const INVALID: usize = !0;
static HANDLE: AtomicUsize = AtomicUsize::new(INVALID);
match HANDLE.load(Relaxed) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need use atomic init_once to guard this, stop calling NtCreateKeyedEvent multiple times

Labels

A-concurrency

Area: Concurrency

merged-by-bors

This PR was explicitly merged by bors.

O-windows

Operating system: Windows

S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

T-libs-api

Relevant to the library API team, which will review and decide on the PR/issue.