[Python-Dev] Reworking the GIL (original) (raw)
Antoine Pitrou solipsis at pitrou.net
Sun Oct 25 21:22:20 CET 2009
- Previous message: [Python-Dev] Buildbot alternate renderings
- Next message: [Python-Dev] Reworking the GIL
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello there,
The last couple of days I've been working on an experimental rewrite of the GIL. Since the work has been turning out rather successful (or, at least, not totally useless and crashing!) I thought I'd announce it here.
First I want to stress this is not about removing the GIL. There still is a Global Interpreter Lock which serializes access to most parts of the interpreter. These protected parts haven't changed either, so Python doesn't become really better at extracting computational parallelism out of several cores.
Goals
The new GIL (which is also the name of the sandbox area I've committed it in, "newgil") addresses the following issues :
- Switching by opcode counting. Counting opcodes is a very crude way of estimating times, since the time spent executing a single opcode can very wildly. Litterally, an opcode can be as short as a handful of nanoseconds (think something like "... is not None") or as long as a fraction of second, or even longer (think calling a heavy non-GIL releasing C function, such as re.search()). Therefore, releasing the GIL every 100 opcodes, regardless of their length, is a very poor policy.
The new GIL does away with this by ditching _Py_Ticker entirely and instead using a fixed interval (by default 5 milliseconds, but settable) after which we ask the main thread to release the GIL and let another thread be scheduled.
- GIL overhead and efficiency in contended situations. Apparently, some OSes (OS X mainly) have problems with lock performance when the lock is already taken: the system calls are heavy. This is the "Dave Beazley effect", where he took a very trivial loop, therefore made of very short opcodes and therefore releasing the GIL very often (probably 100000 times a second), and runs it in one or two threads on an OS with poor lock performance (OS X). He sees a 50% increase in runtime when using two threads rather than one, in what is admittedly a pathological case.
Even on better platforms such as Linux, eliminating the overhead of many GIL acquires and releases (since the new GIL is released on a fixed time basis rather than on an opcode counting basis) yields slightly better performance (read: a smaller performance degradation :-)) when there are several pure Python computation threads running.
- Thread switching latency. The traditional scheme merely releases the GIL for a couple of CPU cycles, and reacquires it immediately. Unfortunately, this doesn't mean the OS will automatically switch to another, GIL-awaiting thread. In many situations, the same thread will continue running. This, with the opcode counting scheme, is the reason why some people have been complaining about latency problems when an I/O thread competes with a computational thread (the I/O thread wouldn't be scheduled right away when e.g. a packet arrives; or rather, it would be scheduled by the OS, but unscheduled immediately when trying to acquire the GIL, and it would be scheduled again only much later).
The new GIL improves on this by combinating two mechanisms:
- forced thread switching, which means that when the switching interval is terminated (mentioned in 1) and the GIL is released, we will force any of the threads waiting on the GIL to be scheduled instead of the formerly GIL-holding thread. Which thread exactly is an OS decision, however: the goal here is not to have our own scheduler (this could be discussed but I wanted the design to remain simple :-) After all, man-years of work have been invested in scheduling algorithms by kernel programming teams).
- priority requests, which is an option for a thread requesting the GIL to be scheduled as soon as possible, and forcibly (rather than any other threads). This is meant to be used by GIL-releasing methods such as read() on files and sockets. The scheme, again, is very simple: when a priority request is done by a thread, the GIL is released as soon as possible by the thread holding it (including in the eval loop), and then the thread making the priority request is forcibly scheduled (by making all other GIL-awaiting threads wait in the meantime).
Implementation
The new GIL is implemented using a couple of mutexes and condition
variables. A {mutex, condition} pair is used to protect the GIL itself,
which is a mere variable named gil_locked
(there are a couple of other
variables for bookkeeping). Another {mutex, condition} pair is used for
forced thread switching (described above). Finally, a separate mutex is
used for priority requests (described above).
The code is in the sandbox: http://svn.python.org/view/sandbox/trunk/newgil/
The file of interest is Python/ceval_gil.h. Changes in other files are very minimal, except for priority requests which have been added at strategic places (some methods of I/O modules). Also, the code remains rather short, while of course being less trivial than the old one.
NB : this is a branch of py3k. There should be no real difficulty porting it back to trunk, provided someone wants to do the job.
Platforms
I've implemented the new GIL for POSIX and Windows (tested under Linux and Windows XP (running in a VM)). Judging by what I can read in the online MSDN docs, the Windows support should include everything from Windows 2000, and probably recent versions of Windows CE.
Other platforms aren't implemented, because I don't have access to the necessary hardware. Besides, I must admit I'm not very motivated in working on niche/obsolete systems. I've e-mailed Andrew MacIntyre in private to ask him if he'd like to do the OS/2 support.
Supporting a new platform is not very difficult: it's a matter of writing the 50-or-so lines of necessary platform-specific macros at the beginning of Python/ceval_gil.h.
The reason I couldn't use the existing thread support (Python/thread_*.h) is that these abstractions are too poor. Mainly, they don't provide:
- events, conditions or an equivalent thereof
- the ability to acquire a resource with a timeout
Measurements
Before starting this work, I wrote ccbench (*), a little benchmark script ("ccbench" being a shorthand for "concurrency benchmark") which measures two things:
- computation throughput with one or several concurrent threads
- latency to external events (I use an UDP socket) when there is zero, one, or several background computation threads running
(*) http://svn.python.org/view/sandbox/trunk/ccbench/
The benchmark involves several computation workloads with different GIL characteristics. By default there are 3 of them: A- one pure Python workload (computation of a number of digits of pi): that is, something which spends its time in the eval loop B- one mostly C workload where the C implementation doesn't release the GIL (regular expression matching) C- one mostly C workload where the implementation does release the GIL (bz2 compression)
In the ccbench directory you will find benchmark results, under Linux, for two different systems I have here. The new GIL shows roughly similar but slightly better throughput results than the old one. And it is much better in the latency tests, especially in workload B (going down from almost a second of average latency with the old GIL, to a couple of milliseconds with the new GIL). This is the combined result of using a time-based scheme (rather than opcode-based) and of forced thread switching (rather than relying on the OS to actually switch threads when we speculatively release the GIL).
As a sidenote, I might mention that single-threaded performance is not degraded at all. It is, actually, theoretically a bit better because the old ticker check in the eval loop becomes simpler; however, this goes mostly unnoticed.
Now what remains to be done?
Having other people test it would be fine. Even better if you have an actual multi-threaded py3k application. But ccbench results for other OSes would be nice too :-) (I get good results under the Windows XP VM but I feel that a VM is not an ideal setup for a concurrency benchmark)
Of course, studying and reviewing the code is welcome. As for integrating it into the mainline py3k branch, I guess we have to answer these questions:
- is the approach interesting? (we could decide that it's just not worth it, and that a good GIL can only be a dead (removed) GIL)
- is the patch good, mature and debugged enough?
- how do we deal with the unsupported platforms (POSIX and Windows support should cover most bases, but the fate of OS/2 support depends on Andrew)?
Regards
Antoine.
- Previous message: [Python-Dev] Buildbot alternate renderings
- Next message: [Python-Dev] Reworking the GIL
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]