Free-threaded Python toy: pgo_task with threads (original) (raw)

I’ll get this discussion started with a little toy people might want to play with. I’ve created an alternative PGO task for Python, using benchmarks from pyperformance. Since it is designed to run all of those benchmarks within a single Python process, it is a handy test-bed to try out multi-threading with free-threaded Python.

This repo for the code is here:

You need to checkout the “threads” branch. To run, use:

python3 -m pgo_task -t <N>

If you run with the non-threaded Python, you get essentially no scaling from multiple cores (no surprise). With free-threaded, you do get some speedup. It is not too impressive at this point, at least based on my test. I get something like 4X speedup with 12 cores.

I haven’t got time yet to dig into what’s keeping it from scaling better. Definitely one of the issues is that anytime random is used without creating a random state object it’s going to be a source of contention. An interesting thing to do would be to use perf record and look where most time is being spent.

Wombat (Zeke W) April 17, 2025, 4:52pm 2

It’s possible that this isn’t a real world problem and only shows up in a toy benchmark that runs random selections in a loop in multiple threads at the same time.

jamestwebber (James Webber) April 17, 2025, 4:59pm 3

My understanding is that it’s not a great idea to use a global random state with threads in general, because it can violate assumptions about the independence of random numbers generated in each thread[1].

So, even if this isn’t a real-world performance issue, it’d be good practice to have independent PRNG in different threads.

not that this is always important, but people will forget about it when it is ↩︎

nas (Neil Schemenauer) April 17, 2025, 5:00pm 4

Well it’s a real world problem if you use the random module and don’t create your own state object for each thread. The random state has locks protecting it and so if you have multiple threads generating random numbers from the same state, it’s not going to scale well. There is a CPython issue somewhere (I think) that suggests to implement a work-around if random.seed() hasn’t yet been used. This is similar to what Go does to avoid this contention issue. For the pyperformance benchmarks, I think that trick mostly wouldn’t help because they call random.seed().

The general design for a good performing (and easier to understand) free-threaded program is that each thread operates mostly on it’s own data. You use queues or similar message passing to move data between threads. Any time you have state that is shared between threads, it will be somewhat sub-optimal. At least, that’s my experience.

Wombat (Zeke W) April 17, 2025, 5:34pm 5

I’m just saying that it is only bottleneck if all you’re doing is generating random numbers in multiple threads and are unwilling to create distinct instances. I don’t think anyone even noticed this until a cute threading demo was made that did something like estimating pi with Buffon’s Needle or somesuch.

When people write actual multithreaded simulations, I think they almost always use separate instances because it is the only way to make the simulation reproducible and because methods like random.gauss aren’t threadsafe to begin with.

You may be right that this is a real-world problem, but I’m dubious. It will be interesting to see if we find deployed code (not toy demos) that show evidence of the issue. Even then, the likely best solution is to recommend using different instances. For multiprocessing, I think that is already the default (the prng gets reseeded for each fork).

Wombat (Zeke W) April 17, 2025, 6:22pm 6

My understanding, which might be incorrect or out-of-date, is that CPython is using a super cheap form of optimistic locking. A first access to a shared state is guarded by an in-use bit which is set atomically and only if a second concurrent access occurs will an actual lock be used. This idea is that state locks are very cheap unless there is an actual simultaneous access.

The calls to _random.Random.random or _random.Random.getrandbits have such a guard. Also, those methods are very fast (less than 20ns on my machine) so they don’t hold the guard bit for very long. In contrast, the pure python code for the Random class doesn’t hold the lock and its code dominates the running time (a call to normalvariate takes over 200 ns for me). The time the lock is held will be a small fraction on the running time so there won’t be much lock contention. That said, it should still be a best practice to create distinct instances.

nas (Neil Schemenauer) April 17, 2025, 7:14pm 7

The locking that free-threaded Python uses was heavily influenced by Webkit’s WTF locks (at least, as I understand, Sam would be the authority). You can read a nice description about how they work here: https://webkit.org/blog/6161/locking-in-webkit/. A lot of those details also apply to CPython. If you want to look at the CPython implementation, I would start by looking at Python/lock.c.

tim.one (Tim Peters) April 17, 2025, 8:22pm 8

PRNGs for multiple threads/processes is a large and involved topic. Python has treated it with “benign neglect”, leaving everything up to the user. numpy has – and appropriately so for it – taken it far more seriously.

At a bare minimum, yes, different threads should absolutely have their own PRNG states. Zero cross-talk/leakage/correlation with other threads’/processes’ PRNG states.

How to seed different threads to ensure this is a large & involved topic of its own, rarely justified beyond hand-waving intuitive probability arguments.. An exception is the Philox generator, which builds on crypto theory. It’s one of the methods numpy supports, and is very popular in massively parallel real-world applications (it’s hard for a user to misuse it, even if they just use consecutive small ints as seeds, and gives very high quality pseudo-randomness - but “is slow” unless there’s special HW support for block cipher algorithms).

Best “simple” thing we can do with the Twister is to create a new Random instance per thread, and seed it with, say os.urandom(16). See? It gets complicated right away . Now we also need a way to tell the user which seed was picked per thread, so they can reproduce the run later.

Let users pick the seeds themselves, and they’re likely to get into subtle troubles. That’s one thing Philox solves. Seeding different threads with 0, 1, 2, 3, … works fine. Does that also work fine with the Twister? Well, who knows? See " hand-waving intuitive probability arguments" above.

When the Twister was first released, seeding with 0, 1, 2, 3, …, was an obvious and utter disaster. Not two weeks went by before they released a new version. It runs some rounds of “pseudo-random scrambling” on the initial seed state to try to spray the 1 bits all over the large internal state. That repaired the obvious disasters - but do others remain?

People wrote papers about that, with no “QED” in sight, but, as time went on, people in this world moved away from the Twister. Part of another problem is the Twister’s very large state. The full state for competitive generators can live in 2 cache lines.

Bottom line: the Twister isn’t the PRNG of choice anymore for massively parallel apps, but it’s the one we have and its massive state isn’t really a problem for “just dozens” of threads. Any sane way of using it in a parallel world, though, requires that each thread have its own copy of the state, preferably seeded with the highest-quality external approximation to “truly random” bits we can get (meaning os.urandom() on most- all? -boxes).

hugovk (Hugo van Kemenade) April 18, 2025, 4:31am 9