[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension (original) (raw)
[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals
Sean Harrington seanharr11 at gmail.com
Thu Oct 18 11:35:16 EDT 2018
- Previous message (by thread): [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals
- Next message (by thread): [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
You have correctly identified the summary of my intentions, and I agree with your reasoning & concern - however there is a somewhat reasonable answer as to why this optimization has never been implemented:
In Pool, the task
tuple consists of (result_job, func, (x,), {}) . This
is the object that is serialized/deserialized b/t processes. The only
thing we really care about here is the tuple (x,)
, confusingly, not
func
(func is ACTUALLY either mapstar() or starmapstar(), which is called
with (x,) as its *args). Our element of interest is (x,)
- a tuple of
(func, iterable). Because we need to temper the size of the iterable
bundled in each task, to avoid de/serialization slowness, we usually end up
with multiple tasks per worker, and thus multiple func
s per worker. Thus,
this is really only an optimization in the case of really big
functions/closures/partials (or REALLY big iterables with an unreasonably
small chunksize passed to map()). The most common use case comes up when
passing instance methods (of really big objects!) to Pool.map().
This post <https://thelaziestprogrammer.com/python/a-multiprocessing-pool-pickle#stuck-in-a-pickle> may color in the above with more details.
Further, let me pivot on my idea of qualname...we can use the id
of
func
as the cache key to address your concern, and store this id
on the
task
tuple (i.e. an integer in-lieu of the func
previously stored
there).
On Thu, Oct 18, 2018 at 12:49 AM Michael Selik <michael.selik at gmail.com> wrote:
If imapunordered is currently re-pickling and sending func each time it's called on the worker, I have to suspect there was some reason to do that and not cache it after the first call. Rather than assuming that's an opportunity for an optimization, I'd want to be certain it won't have edge case negative effects.
On Tue, Oct 16, 2018 at 2:53 PM Sean Harrington <seanharr11 at gmail.com> wrote: Is your concern something like the following?
with Pool(8) as p: gen = p.imapunordered(func, ls) firstelem = next(gen) p.applyasync(longfunc, x) remainingelems = [elem for elem in gen] My concern was passing the same function (or a function with the same qualname). You're suggesting caching functions and identifying them by qualname to avoid re-pickling a large stateful object that's shoved into the function's defaults or closure. Is that a correct summary? If so, how would the function cache distinguish between two functions with the same name? Would it need to examine the defaults and closure as well? If so, that means it's pickling the second one anyway, so there's no efficiency gain. In [1]: def foo(a): ...: def bar(): ...: print(a) ...: return bar In [2]: f = foo(1) In [3]: g = foo(2) In [4]: f Out[4]: <function _main_.foo..bar()> In [5]: g Out[5]: <function _main_.foo..bar()> If we say pool.applyasync(f) and pool.applyasync(g), would you want the latter one to avoid serialization, letting the worker make a second call with the first function object? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20181018/999d3aa6/attachment.html>
- Previous message (by thread): [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals
- Next message (by thread): [Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]