The lightest lightweight threads, Protothreads (original) (raw)

Last week, I used the Lwt cooperative lightweight thread library to implement a benchmark that measures context switch performance, determined that it was GC bound and timed it against comparable programs (i.e., the fastest implementations in the computer language benchmark games, which are all based on lightweight threads) and a C version that uses POSIX threads, obtaining these results:

implementation	memory usage	time (s)
Haskell GHC 6.8.2	2680KB	1.22
OCaml ocamlopt 1024Kword minor heap	5178KB	1.85
Haskell GHC 6.8.2 -threaded, -N1	2760KB	1.9
Erlang R12B-1 HiPE	5996KB	3.96
Mozart/Oz	3788KB	4.10
OCaml ocamlopt 32Kword minor heap	970KB	4.24
Haskell GHC 6.8.2 -threaded, -N2	3300KB	15.27
GCC C (POSIX threads)	4520KB	28.7

Here are the figures I get for the C version I made withProtothreads:

Addendum This post wouldn't exist hadn't rektide asked for these results.

GCC C (Protothreads, optimum scheduling)	220KB	0.076s
GCC C (Protothreads, pessimum scheduling)	220KB	18.6s

(Read below for an explanation of the difference between optimum and pessimum scheduling policies.)

It is nearly 400 times faster than the C version with POSIX threads, and represents a one order of magnitude improvement over the other lightweight thread implementations. It also needs less memory. The performance is almost unbelievable. Where's the catch, ugly code maybe? Judge for yourself; here are the Protothreads and the POSIX threads C implementations head to head*1:

/* Protothreads, 0.076s*/ /* POSIX threads, 28.7s */ #include <stdio.h> #include <stdio.h> #include <stdlib.h> #include <stdlib.h> #include "pt.h" #include <pthread.h> #include "pt-sem.h" #include <limits.h>

#define NUM_THREADS 503 #define NUM_THREADS 503

static struct pt thread_context[NUM_THREADS]; struct stack { static int data[NUM_THREADS]; char x[PTHREAD_STACK_MIN]; static struct pt_sem mutex[NUM_THREADS]; };

static static pthread_mutex_t mutex[THREADS]; PT_THREAD(thread(struct pt pt, int id)) static int data[THREADS]; { static struct stack stacks[THREADS]; static int token;
static int r; static void thread(void *num) PT_BEGIN(pt); { int l = (int)num; while(1) { int r = (l+1) % THREADS; PT_SEM_WAIT(pt, &mutex[id]); int token; token = data[id];
r = (id + 1) % NUM_THREADS; while(1) { if(token) { pthread_mutex_lock(mutex + l); data[r] = token - 1; token = data[l]; PT_SEM_SIGNAL(pt, &mutex[r]); if (token) { } else { data[r] = token - 1; printf("%d\n", id + 1); pthread_mutex_unlock(mutex + r); exit(0); } } else { } printf("%i\n", l+1); PT_END(pt); exit(0); } } } int } main(int argc, char *argv[])
{ int main(int argc, char **argv) int i; { int i; if(argc != 2) exit(255); pthread_t cthread; data[0] = atoi(argv[1]); pthread_attr_t stack_attr;

for(i = 0; i < NUM_THREADS; i++) { if (argc != 2) exit(255); PT_SEM_INIT(&mutex[i], 0); data[0] = atoi(argv[1]); PT_INIT(&thread_context[i]);
} pthread_attr_init(&stack_attr);

PT_SEM_INIT(&mutex[0], 1); for (i = 0; i < THREADS; i++) { pthread_mutex_init(mutex + i, NULL); while(1) { pthread_mutex_lock(mutex + i); for(i = 0; i < NUM_THREADS; i++)
thread(&thread_context[i], i); pthread_attr_setstack(&stack_attr, &stacks[i], } sizeof(struct stack)); } pthread_create(&cthread, &stack_attr, thread, (void*)i); }

                                              pthread_mutex_unlock(mutex + 0);
                                              pthread_join(cthread, NULL);
                                           }

Even though the Protothreads code is similar to the one with pthreads, there are two important differences:

there is no thread-local data in the Protothreads version. r is recomputed on each iteration
the proto threads are scheduled manually by running all the corresponding functions in sequence

These dissimilarities give some insight into the nature of Protothreads. They are little more than small state machines whose state is stored in an opaque pt structure. Protothreads' implementation consists of a few preprocessor macros in headers, so the best way to see how they work is to examine the output of the preprocessor (gcc -E, slightly abridged and reformatted here):

typedef unsigned short lc_t;

struct pt { lc_t lc; }; struct pt_sem { unsigned int count; };

static struct pt thread_context[503]; static int data[503]; static struct pt_sem mutex[503];

static char thread(struct pt *pt, int id) { static int token; static int r; { char PT_YIELD_FLAG = 1; switch((pt)->lc) { case 0: while(1) { do { do { (pt)->lc = 20; case 20:; if(!((&mutex[id])->count > 0)) { return 0; } } while(0); --(&mutex[id])->count; } while(0); token = data[id]; r = (id + 1) % 503; if(token) { data[r] = token - 1; ++(&mutex[r])->count; } else { printf("%d\n", id + 1); exit(0); } } }; PT_YIELD_FLAG = 0; (pt)->lc = 0;; return 3; }; }

int main(int argc, char *argv[]) { int i;

if(argc != 2) exit(255); data[0] = atoi(argv[1]);

for(i = 0; i < 503; i++) { (&mutex[i])->count = 0; (&thread_context[i])->lc = 0;; }

(&mutex[0])->count = 1;

while(1) { for(i = 0; i < 503; i++) thread(&thread_context[i], i); } }

Protothreads use a little-known trick: switch statements do not have to be structured, in the sense that they can jump directly into the body of internal statements. For instance, you can place a case label in the body of an _if_statement.

Essentially, switch is being used to perform arbitrary gotos determined by the value of the pt thread state. This sounds very much like computed gotos, and, indeed, there's another implementation of Protothreads's LC (as in "Local Continuations") core based on that GCC extension. (I have to confess that I have never used switch in such an "unstructured" way; I've employed computed gotos, though.)

Context switch points are marked using macros like PT_SEM_WAIT, which expand into code that:

sets the thread state to a value representing the current instruction pointer
adds a label for that value

The generated label allows the proto thread to jump to the saved IP. The thread state is encoded in an unsigned short int and takes only 2 bytes; there is no thread-local state by default, since local variables aren't restored across runs.

Scheduling

Threads are scheduled by calling the thread function with the appropriate thread state. Proto-threads blocked in PT_SEM_WAIT will quickly return without doing anything. The reason why Protothreads is so good at the threadring benchmark is that threads are being scheduled in an optimal way with no overhead. In threadring, the Nth thread unblocks the (N+1)th one, which matches the order in which the threads are executed in the above code. This means that the PT_SEM_WAIT will never actually "block" (i.e., return without doing anything) and the program progresses steadily.

Things looks very different if proto-threads are scheduled in reverse order, by doing

while(1) { for(i = NUM_THREADS - 1 ; i >= 0; i--) thread(&thread_context[i], i); }

This is the worst-case scenario for threaring, as 501 threads in the ring will be visited before the only runnable one is found:

GCC C (Protothreads)	220KB	18.6s

The pessimum scenario is 250 slower than the optimal one. We don't get a ratio of 500 here because the benchmark involves some minimal computation in addition to context switching. This overhead is only observable in the optimum Protothreads version (remember that the fastest general-purpose lightweight thread implementation, GHC's, took no less than 1.2s to perform 10 million context switches).

Generalization

Protothreads are exceptionally fast in the threadring benchmark because it does little more than context switching, and the optimum scheduling policy is found trivially. In practice, when you use (lightweight) threads, you will more often than not want at least:

thread-local state
automatic scheduling

These features can be built on top of Protothreads's core, but they will make the resulting code both slower and more memory-hungry than the optimum case shown above. Scheduling, in particular, will require at least that a queue of runnable threads be maintained. This will be fast (a straightforward implementation will be logarithmic in the number of threads), but still considerably slower than the above code.

The author of Protothreads recommends them for applications like "memory constrained systems, event-driven protocol stacks, small embedded systems, (or) sensor network nodes" and, indeed, nothing comes close in terms of lightness for these kinds of tasks.

No Title - Anonymous (2008-04-01 (Tue) 21:27:27)

But how do they scale?

mfp 2008-04-02 (Wed) 05:17:04

I assume this is tongue-in-cheek, but here's an answer anyway. It depends on the intended meaning of "scale":

it scales in the sense that you can have literally tens of millions of Protothreads, if you manage to schedule them efficiently (yes, that's a strong restriction)
it doesn't "scale out" because Protothreads will neither take advantage of multicore processors, nor help you at all when you want to distribute the processes

yw - rektide (2008-04-01 (Tue) 22:57:32)

you're welcome for the reference to protothreads. ;) i'll trade you a lack of name drop for a detailed follow up on a minimalist lightweight threading library? i kid i kid, glad to see a good post on what protothreads actually does, which is considerably more interesting than just adding its benchmark.

thread-local state should be borderline trivial, thread scheduling on the other hand has no straight-forwards solution.

mfp 2008-04-02 (Wed) 05:06:47

Indeed, this post exists only because you asked for the benchmark (I heard about Protothreads some time ago, via anarchaia IIRC, but didn't have it in mind when I compared the initial implementations); I'm adding a note to that effect.

Thread-local state is certainly easy, as it can be passed to the PT_THREAD as an argument (it'd still be less convenient than lexically scoped, automatic variables). Scheduling isn't hard either, but it's bound to be slower than the perfect, constant-time, optimum "null scheduler" :)

GNU portable threads - Kyku (2008-04-02 (Wed) 02:13:06)

Hello, would you benchmark them?

http://www.gnu.org/software/pth/

mfp 2008-04-02 (Wed) 05:10:06

This should be interesting:

each thread has it's own individual program-counter, run-time stack, signal mask and errno variable

I'm taking note of that; I'll eventually add the results to this page (or on a separate post if I find something interesting to say about GNU Pth).

PuTTY - Tony Finch (2008-04-02 (Wed) 08:14:29)

Simon Tatham has used switch-based coroutines in PuTTY and has an article about them on his web page.

http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

mfp 2008-04-04 (Fri) 08:36:04

It's exactly the same mechanism as ProtoThreads's LCs. This is a great link; it allows us to reconstruct the genealogy of this technique. It turns out the trick was already known by Tom Duff (as in "Duff's device") in 1983(!) (a "revolting way to use switches to implement interrupt driven state machines").

Efficient scheduling - MenTaLguY (2008-04-03 (Thr) 14:58:10)

Efficient scheduling is mostly just a matter of establishing separate wait queues for blocking protothreads (e.g. condition variables rather than simple condition polling), and using a better data structure than an array.

MenTaLguY 2008-04-03 (Thr) 15:02:03

And, I fail for reading comprehension. You said that at the end, didn't you?

mfp 2008-04-04 (Fri) 08:38:22

What do they say about impatience and programmers? ;)

State Threads - Gabriel (2008-06-09 (Mon) 03:49:55)

I benchmarked threads library recently. Basically, I compared toy webservers. The results are not yet available (but will be soon). Anyway, if you want something blazingly fast, you should definitely try State Threads:

http://state-threads.sourceforge.net/

It outperforms GNU Pth, NPTL and a few others libraries, while providing high-level functions (including a scheduler, which protothreads really lacks). Moreover, it performs quite well on microbenchmarks too, so I guess it could be a good challenger in your experiment.

green threads considered harmful - rarecactus (2008-07-07 (Mon) 01:05:01)

What you're talking about here is "green threads"-- i.e., threads that the kernel does not know about, implemented entirely in userspace.

Green threads can do well in toy benchmarks like these because they don't do any context switching. That means, among other things, that the TLB doesn't need to be flushed.

Unfortunately, green threads have a lot of serious drawbacks. Because the kernel does not know about them, it cannot schedule them intelligently.

Example: imagine if kernel thread X is implementing green threads G1 and G2. G1 is merrily executing along. Then, it hits a cache miss in the L2 cache and must fetch data from memory. The kernel puts thread X to sleep until the disk finishes fetching that memory. This may take on the order of milliseconds. During that time, G2 will not be executed. Even though G2 actually doesn't care about the memory in question, there is no way for the kernel to know that.

Kernel threads can block for an arbitrary amount of time. For example, when a disk is in the process of going bad, reads may take longer and longer-- even on the order of several seconds. Or if you are waiting for data on a blocking read from a socket, there is no telling how long it could take. During that time, all of the green threads which share the same "real" kernel thread are being held hostage.

Basically, the problems result from the fact that green threads are not integrated with the rest of the scheduler. One implication is that all of the green threads implemented by the same kernel thread must reside on the same CPU. The kernel cannot transfer them to other CPUs because it doesn't know about them. This will become a heavier and heavier burden with the rise of multi-core systems. And, of course, using tools like strace on them is out of the question.

All in all, the longer you think about them, the less clever green threads look.