[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension (original) (raw)

[Python-Dev] bpo-34837: Multiprocessing.Pool API Extension - Pass Data to Workers w/o Globals

Sean Harrington seanharr11 at gmail.com
Thu Oct 18 11:35:16 EDT 2018


You have correctly identified the summary of my intentions, and I agree with your reasoning & concern - however there is a somewhat reasonable answer as to why this optimization has never been implemented:

In Pool, the task tuple consists of (result_job, func, (x,), {}) . This is the object that is serialized/deserialized b/t processes. The only thing we really care about here is the tuple (x,), confusingly, not func (func is ACTUALLY either mapstar() or starmapstar(), which is called with (x,) as its *args). Our element of interest is (x,) - a tuple of (func, iterable). Because we need to temper the size of the iterable bundled in each task, to avoid de/serialization slowness, we usually end up with multiple tasks per worker, and thus multiple funcs per worker. Thus, this is really only an optimization in the case of really big functions/closures/partials (or REALLY big iterables with an unreasonably small chunksize passed to map()). The most common use case comes up when passing instance methods (of really big objects!) to Pool.map().

This post <https://thelaziestprogrammer.com/python/a-multiprocessing-pool-pickle#stuck-in-a-pickle> may color in the above with more details.

Further, let me pivot on my idea of qualname...we can use the id of func as the cache key to address your concern, and store this id on the task tuple (i.e. an integer in-lieu of the func previously stored there).

On Thu, Oct 18, 2018 at 12:49 AM Michael Selik <michael.selik at gmail.com> wrote:

If imapunordered is currently re-pickling and sending func each time it's called on the worker, I have to suspect there was some reason to do that and not cache it after the first call. Rather than assuming that's an opportunity for an optimization, I'd want to be certain it won't have edge case negative effects.

On Tue, Oct 16, 2018 at 2:53 PM Sean Harrington <seanharr11 at gmail.com> wrote: Is your concern something like the following?

with Pool(8) as p: gen = p.imapunordered(func, ls) firstelem = next(gen) p.applyasync(longfunc, x) remainingelems = [elem for elem in gen] My concern was passing the same function (or a function with the same qualname). You're suggesting caching functions and identifying them by qualname to avoid re-pickling a large stateful object that's shoved into the function's defaults or closure. Is that a correct summary? If so, how would the function cache distinguish between two functions with the same name? Would it need to examine the defaults and closure as well? If so, that means it's pickling the second one anyway, so there's no efficiency gain. In [1]: def foo(a): ...: def bar(): ...: print(a) ...: return bar In [2]: f = foo(1) In [3]: g = foo(2) In [4]: f Out[4]: <function _main_.foo..bar()> In [5]: g Out[5]: <function _main_.foo..bar()> If we say pool.applyasync(f) and pool.applyasync(g), would you want the latter one to avoid serialization, letting the worker make a second call with the first function object? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20181018/999d3aa6/attachment.html>



More information about the Python-Dev mailing list