[Python-Dev] Slides from today's parallel/async Python talk (original) (raw)

Trent Nelson trent at snakebite.org
Thu Mar 14 19:45:20 CET 2013

Previous message: [Python-Dev] Slides from today's parallel/async Python talk
Next message: [Python-Dev] Slides from today's parallel/async Python talk
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Mar 13, 2013 at 07:05:41PM -0700, Trent Nelson wrote:

Just posted the slides for those that didn't have the benefit of attending the language summit today:

https://speakerdeck.com/trent/parallelizing-the-python-interpreter-an-alternate-approach-to-async

Someone on /r/python asked if I could elaborate on the "do Y" part
of "if we're in a parallel thread, do Y, if not, do X", which I
(inadvertently) ended up replying to in detail.  I've included the
response below.  (I'll work on converting this into a TL;DR set of
slides soon.)

Can you go into a bit of depth about "X" here?

That's a huge topic that I'm hoping to tackle ASAP. The basic premise is that parallel 'Context' objects (well, structs) are allocated for each parallel thread callback. The context persists for the lifetime of the "parallel work".

The "lifetime of the parallel work" depends on what you're doing. For a simple async.submit_work(foo), the context is considered complete once foo() has been called (presuming no exceptions were raised).

For an async client/server, the context will persist for the entirety of the connection.

The context is responsible for encapsulating all resources related to the parallel thread. So, it has its own heap, and all memory allocations are taken from that heap.

For any given parallel thread, only one context can be executing at a time, and this can be accessed via the __declspec(thread) Context *ctx global (which is primed by some glue code as soon as the parallel thread starts executing a callback).

No reference counting or garbage collection is done during parallel thread execution. Instead, once the context is finished, it is scheduled to be released, which means it'll be "processed" by the main thread as part of its housekeeping work (during async.run() (technically, async.run_once()).

The main thread simply destroys the entire heap in one fell swoop, releasing all memory that was associated with that context.

There are a few side effects to this. First, the heap allocator (basically, the thing that answers malloc() calls) is incredibly simple. It allocates LARGE_PAGE_SIZE chunks of memory at a time (2MB on x64), and simply returns pointers to that chunk for each memory request (adjusting h->next and allocation stats as it goes along, obviously). Once the 2MB has been exhausted, another 2MB is allocated.

That approach is fine for the submit_(work|timer|wait) callbacks, which basically provide a way to run a presumably-finite-length function in a parallel thread (and invoking callbacks/errbacks as required).

However, it breaks down when dealing with client/server stuff. Each invocation of a callback (say, data_received(...)) may only consume, say, 500 bytes, but it might be called a million times before the connection is terminated. You can't have cumulative memory usage with possibly-infinite-length client/server-callbacks like you can with the once-off submit_(work|wait|timer) stuff.

So, enter heap snapshots. The logic that handles all client/server connections is instrumented such that it takes a snapshot of the heap (and all associated stats) prior to invoking a Python method (via PyObject_Call(), for example, i.e. the invocation of data_received).

When the method completes, we can simply roll back the snapshot. The heap's stats and next pointers et al all get reset back to what they were before the callback was invoked.

That's how the chargen server is able to pump out endless streams of data for every client whilst keeping memory usage static. (Well, every new client currently consumes at least a minimum of 2MB (but down the track that can be tweaked back down to SMALL_PAGE_SIZE, 4096, for servers that need to handle hundreds of thousands of clients simultaneously).

The only issue with this approach is detecting when the callback has done the unthinkable (from a shared-nothing perspective) and persisted some random object it created outside of the parallel context it was created in.

That's actually a huge separate technical issue to tackle -- and it applies just as much to the normal submit_(wait|work|timer) callbacks as well. I've got a somewhat-temporary solution in place for that currently:

d = async.dict()
def foo():
    # async.rdtsc() is a helper method
    # that basically wraps the result of
    # the assembly RDTSC (read time-
    # stamp counter) instruction into a
    # PyLong object.  So, it's handy when
    # I need to test the very functionality
    # being demonstrated here (creating
    # an object within a parallel context
    # and persisting it elsewhere).
    d['foo'] = async.rdtsc()

def bar():
    d['bar'] = async.rdtsc()

async.submit_work(foo)
async.submit_work(bar)

That'll result in two contexts being created, one for each callback invocation. async.dict() is a "parallel safe" wrapper around a normal PyDict. This is referred to as "protection".

In fact, the code above could have been written as follows:

d = async.protect(dict())

What protect() does is instrument the object such that we intercept __getitem__, __setitem__, __getattr__ and __setattr__. We replace these methods with counterparts that serve two purposes:

The read-only methods are wrapped in a read-lock, the write methods are wrapped in a write lock (using underlying system slim read/write locks, which are uber fast). (Basically, you can have unlimited readers holding the read lock, but only one writer can hold the write lock (excluding all the readers and other writers).)
Detecting when parallel objects (objects created from within a parallel thread, and thus, backed by the parallel context's heap) have been assigned outside the context (in this case, to a "protected" dict object that was created from the main thread).

The first point is important as it ensures concurrent access doesn't corrupt the data structure.

The second point is important because it allows us to prevent the persisted object's context from automatically transitioning into the complete->release->heapdestroy lifecycle when the callback completes.

This is known as "persistence", as in, a context has been persisted. All sorts of things happen to the object when we detect that it's been persisted. The biggest thing is that reference counting is enabled again for the object (from the perspective of the main thread; ref counting is still a no-op within the parallel thread) -- however, once the refcount hits zero, instead of free()ing the memory like we'd normally do in the main thread (or garbage collecting it), we decref the reference count of the owning context.

Once the owning context's refcount goes to zero, we know that no more references exist to objects created from that parallel thread's execution, and we're free to release the context (and thus, destroy the heap -> free the memory).

That's currently implemented and works very well. There are a few drawbacks: one, the user must only assign to an "async protected" object. Use a normal dict and you're going to segfault or corrupt things (or worse) pretty quickly.

Second, we're persisting the entire context potentially for a single object. The context may be huge; think of some data processing callback that ran for ages, racked up a 100MB footprint, but only generated a PyLong with the value 42 at the end, which consumes, like, 50 bytes (or whatever the size of a PyLong is these days).

It's crazy keeping a 100MB context around indefinitely until that PyLong object goes away, so, we need another option. The idea I have for that is "promotion". Rather than persist the context, the object is "promoted"; basically, the parallel thread palms it off to the main thread, which proceeds to deep-copy the object, and take over ownership. This removes the need for the context to be persisted.

Now, I probably shouldn't have said "deep-copy" there. Promotion is a terrible option for anything other than simple objects (scalars). If you've got a huge list that consumes 98% of your 100MB heap footprint, well, persistence is perfect. If it's a 50 byte scalar, promotion is perfect. (Also, deep-copy implies collection interrogation, which has all sorts of complexities, so, err, I'll probably end up supporting promotion if the object is a scalar that can be shallow-copied. Any form of collection or non-scalar type will get persisted by default.)

I haven't implemented promotion yet (persistence works well enough for now). And none of this is integrated into the heap snapshot/rollback logic -- i.e. we don't detect if a client/server callback assigned an object created in the parallel context to a main-thread object -- we just roll back blindly as soon as the callback completes.

Before this ever has a chance of being eligible for adoption into CPython, those problems will need to be addressed. As much as I'd like to ignore those corner cases that violate the shared-nothing approach -- it's inevitable someone, somewhere, will be assigning parallel objects outside of the context, maybe for good reason, maybe by accident, maybe because they don't know any better. Whatever the reason, the result shouldn't be corruption.

So, the remaining challenge is preventing the use case alluded to earlier where someone tries to modify an object that hasn't been "async protected". That's a bit harder. The idea I've got in mind is to instrument the main CPython ceval loop, such that we do these checks as part of opcode processing. That allows us to keep all the logic in the one spot and not have to go hacking the internals of every single object's C backend to ensure correctness.

Now, that'll probably work to an extent. I mean, after all, there are opcodes for all the things we'd be interested in instrumenting, LOAD_GLOBAL, STORE_GLOBAL, SETITEM etc. What becomes challenging is detecting arbitrary mutations via object calls, i.e. how do we know, during the ceval loop, that foo.append(x) needs to be treated specially if foo is a main-thread object and x is a parallel thread object?

There may be no way to handle that other than hacking the internals of each object, unfortunately. So, the viability of this whole approach may rest on whether or that's deemed as an acceptable tradeoff (a necessary evil, even) to the Python developer community.

If it's not, then it's unlikely this approach will ever see the light of day in CPython. If that turns out to be the case, then I see this project taking the path that Stackless took (forking off and becoming a separate interpreter).

There's nothing wrong with that; I am really excited about the possibilities afforded by this approach, and I'm sure it will pique the interest of commercial entities out there that have problems perfectly suited to where this pattern excels (shared-nothing, highly concurrent), much like the relationship that developed between Stackless and Eve Online.

So, it'd be great if it eventually saw the light of day in CPython, but that'll be a long way down the track (at least 4.x I'd say), and all these issues that allow you to instantly segfault or corrupt the interpreter will need to be addressed before it's even eligible for discussion about inclusion.

Regards,

    Trent.

Previous message: [Python-Dev] Slides from today's parallel/async Python talk
Next message: [Python-Dev] Slides from today's parallel/async Python talk
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list