Expected performance characteristics of subinterpreters (original) (raw)

I’m testing the subinterpreters interface because I’ll likely be taking advantage of it to improve scaling of this library across multiple threads. Subinterpreters are attractive because:

I need more flexible shared memory than multiprocessing can provide. Most of my parallelized work fills different parts of shared arrays. Although multiprocessing.SharedMemory is an option, I need to fill a lot of arrays of different sizes and I want users to be able to delete some of them after reading and (get their memory back). Allocating lots of SharedMemory objects as named files in the filesystem is not a good option.
I need to release the GIL for different threads of Python code. Although the important parts of the parallelized work are in compiled extensions that release the GIL (decompression and NumPy), there are enough Python steps between the GIL-released steps that scaling is killed by Amdahl’s law.

So subinterpreters seem like a perfect fit. I’ve read PEP 554 and PEP 734 and have been eagerly awaiting the beta release. (Queues/channels didn’t work in the alpha release.)

I just tried it out and learned two things:

Launching new subinterpreters is slower than launching new processes. This is a surprise to me. (I’ll show code below.)
Sending data to and from subinterpreters is a lot faster than sending data to external processes. This is not a surprise.

Here’s some code and some timing numbers on a 3 GHz computer with 16 cores in Linux. All scripts have the same imports:

import time
import multiprocessing
import threading

from test import support
from test.support import import_helper

_interpreters = import_helper.import_module("_interpreters")
from test.support import interpreters

First, to compare launching times of subinterpreters and processes:

def in_subinterp():
    2 + 2

def in_thread():
    subinterp = interpreters.create()
    subinterp.call(in_subinterp)
    subinterp.close()

starttime = time.perf_counter()

so_many = []
for _ in range(10000):
    so_many.append(threading.Thread(target=in_thread))

for x in so_many:
    x.start()

for x in so_many:
    x.join()

print(time.perf_counter() - starttime)

and

def in_process():
    2 + 2

starttime = time.perf_counter()

so_many = []
for _ in range(10000):
    so_many.append(multiprocessing.Process(target=in_process))

for x in so_many:
    x.start()

for x in so_many:
    x.join()

print(time.perf_counter() - starttime)

Launching 10 thousand subinterpreters took 11.1 seconds, while starting 10 thousand processes took 7.3 seconds; a factor of 1.5. It was a lot worse with call_in_thread (74.4 seconds for the subinterpreters), but I think that might have been blocking between calls, at least partially. Above, both scripts start a suite of 10 thousand threads/processes and the interpreters start independently in each, calculate 2 + 2 (to be sure they’ve really started), and then shut down. If thread start-up times are equal to process start-up times (it’s Linux), then starting each subinterpreter is doing something that costs… 6 ms more than forking? ((11.1 - 7.3) / (10000 / 16)… something like that.)

As for upper limits, the number of processes is constrained by ulimit, which is problematic for a Python library because ulimit is configured outside of Python. Subprocesses don’t seem to have a limit, though on one of the tests of 10 thousand subinterpreters, I got this non-reproducible error:

  File "/home/jpivarski/tmp/subinterpreters/many-subinterps.py", line 17, in in_thread
    subinterp = interpreters.create()
  File "/home/jpivarski/tmp/subinterpreters/Python-3.13.0b1/Lib/test/support/interpreters/__init__.py", line 76, in create
    id = _interpreters.create(reqrefs=True)
interpreters.InterpreterError: interpreter creation failed

Next, to measure communication times:

def in_subinterp():
    from test.support.interpreters import queues

    to_subinterp = queues.Queue(to_id)
    from_subinterp = queues.Queue(from_id)

    total = 0
    while True:
        obj = to_subinterp.get()
        if obj is None:
            break
        total += obj

    from_subinterp.put(total, syncobj=True)


to_subinterp = queues.create()
from_subinterp = queues.create()

starttime = time.perf_counter()

subinterp = interpreters.create()
subinterp.prepare_main({"to_id": to_subinterp.id, "from_id": from_subinterp.id})
subinterp.call_in_thread(in_subinterp)

for x in range(10000000):
    to_subinterp.put(x, syncobj=True)

to_subinterp.put(None, syncobj=True)

total = from_subinterp.get()

print(time.perf_counter() - starttime)

and

def in_process(to_process, from_process):
    total = 0
    while True:
        obj = to_process.get()
        if obj is None:
            break
        total += obj

    from_process.put(total)


to_process = multiprocessing.Queue()
from_process = multiprocessing.Queue()

starttime = time.perf_counter()

process = multiprocessing.Process(
    target=in_process, args=(to_process, from_process)
)
process.start()

for x in range(10000000):
    to_process.put(x)

to_process.put(None)

total = from_process.get()
print(time.perf_counter() - starttime)

Sending 10 million integers to one subinterpreter using a Queue took 6.1 seconds, whereas sending 10 million integers to one process using a Queue took 43.0 seconds. That’s a factor of 7 in subinterpreter’s favor, and I expected something like that.

For completeness, here’s a script to get a baseline (single-threaded, performing the same computation):

starttime = time.perf_counter()

total = 0
for x in range(10000000):
    total += x

print(f"main {total = }")

print(time.perf_counter() - starttime)

It took 0.8 seconds, so the work that the subinterpreter Queue is doing to send data is about 8× more expensive than adding integers; about 0.5 ms per queue item ((6.1 - 0.8) / 10000000). Not bad!

The next thing that would be interesting to test is the scaling of Python code that updates a shared array in subinterpreters. I personally believe that the scaling would be close to perfect (only run into issues at ~1 Gbps due to all threads trying to pull data over the same memory bus, like a C program), but I couldn’t test it because I don’t know how to install a package like NumPy with a manually compiled Python (I normally let Conda manage the environments) and this workaround using ctypes:

big_array = array.array("i")
big_array.fromfile(open("/tmp/numbers.int32", "rb"), 16*10000000)
pointer, _ = big_array.buffer_info()

# pass the (integer) pointer to the subinterpreter and

big_array = (ctypes.c_int32 * (16*10000000)).from_address(pointer)

didn’t work because

Traceback (most recent call last):
  File "/home/jpivarski/tmp/subinterpreters/subinterp-multithread.py", line 24, in in_subinterp
    import ctypes
  File "/home/jpivarski/tmp/subinterpreters/Python-3.13.0b1/Lib/ctypes/__init__.py", line 8, in <module>
    from _ctypes import Union, Structure, Array
ImportError: module _ctypes does not support loading in subinterpreters

If all of the performance results above are as expected, then it suggests a usage strategy for subinterpreters:

Create them sparingly, like OS threads/processes, and unlike green threads. I can see now why a pool-style interface is anticipated: we’re not going to want to create and destroy these subinterpreters often. I don’t know why they’re noticably more heavyweight than processes, but even if they were equal, this usage strategy would apply.
Communication with the subinterpreter (using shareable data types) is relatively inexpensive. It only costs 0.5 ms or so to send data through a Queue. It’s certainly good enough for sending brief instructions to a mostly-autonomous subinterpreter.
The bulk processing should operate on arrays owned by the main interpreter. With NumPy, a pointer or a memoryview can be sent to each subinterpreter and they can be viewed with np.frombuffer. Although I couldn’t test it, this kind of work ought to scale as well as C code.

Is that right? This is a question for the subinterpreter developers—are these rough performance numbers and interpretations in line with what you expect?