Expected performance characteristics of subinterpreters (original) (raw)

I’m testing the subinterpreters interface because I’ll likely be taking advantage of it to improve scaling of this library across multiple threads. Subinterpreters are attractive because:

So subinterpreters seem like a perfect fit. I’ve read PEP 554 and PEP 734 and have been eagerly awaiting the beta release. (Queues/channels didn’t work in the alpha release.)

I just tried it out and learned two things:

  1. Launching new subinterpreters is slower than launching new processes. This is a surprise to me. (I’ll show code below.)
  2. Sending data to and from subinterpreters is a lot faster than sending data to external processes. This is not a surprise.

Here’s some code and some timing numbers on a 3 GHz computer with 16 cores in Linux. All scripts have the same imports:

import time
import multiprocessing
import threading

from test import support
from test.support import import_helper

_interpreters = import_helper.import_module("_interpreters")
from test.support import interpreters

First, to compare launching times of subinterpreters and processes:

def in_subinterp():
    2 + 2

def in_thread():
    subinterp = interpreters.create()
    subinterp.call(in_subinterp)
    subinterp.close()

starttime = time.perf_counter()

so_many = []
for _ in range(10000):
    so_many.append(threading.Thread(target=in_thread))

for x in so_many:
    x.start()

for x in so_many:
    x.join()

print(time.perf_counter() - starttime)

and

def in_process():
    2 + 2

starttime = time.perf_counter()

so_many = []
for _ in range(10000):
    so_many.append(multiprocessing.Process(target=in_process))

for x in so_many:
    x.start()

for x in so_many:
    x.join()

print(time.perf_counter() - starttime)

Launching 10 thousand subinterpreters took 11.1 seconds, while starting 10 thousand processes took 7.3 seconds; a factor of 1.5. It was a lot worse with call_in_thread (74.4 seconds for the subinterpreters), but I think that might have been blocking between calls, at least partially. Above, both scripts start a suite of 10 thousand threads/processes and the interpreters start independently in each, calculate 2 + 2 (to be sure they’ve really started), and then shut down. If thread start-up times are equal to process start-up times (it’s Linux), then starting each subinterpreter is doing something that costs… 6 ms more than forking? ((11.1 - 7.3) / (10000 / 16)… something like that.)

As for upper limits, the number of processes is constrained by ulimit, which is problematic for a Python library because ulimit is configured outside of Python. Subprocesses don’t seem to have a limit, though on one of the tests of 10 thousand subinterpreters, I got this non-reproducible error:

  File "/home/jpivarski/tmp/subinterpreters/many-subinterps.py", line 17, in in_thread
    subinterp = interpreters.create()
  File "/home/jpivarski/tmp/subinterpreters/Python-3.13.0b1/Lib/test/support/interpreters/__init__.py", line 76, in create
    id = _interpreters.create(reqrefs=True)
interpreters.InterpreterError: interpreter creation failed

Next, to measure communication times:

def in_subinterp():
    from test.support.interpreters import queues

    to_subinterp = queues.Queue(to_id)
    from_subinterp = queues.Queue(from_id)

    total = 0
    while True:
        obj = to_subinterp.get()
        if obj is None:
            break
        total += obj

    from_subinterp.put(total, syncobj=True)


to_subinterp = queues.create()
from_subinterp = queues.create()

starttime = time.perf_counter()

subinterp = interpreters.create()
subinterp.prepare_main({"to_id": to_subinterp.id, "from_id": from_subinterp.id})
subinterp.call_in_thread(in_subinterp)

for x in range(10000000):
    to_subinterp.put(x, syncobj=True)

to_subinterp.put(None, syncobj=True)

total = from_subinterp.get()

print(time.perf_counter() - starttime)

and

def in_process(to_process, from_process):
    total = 0
    while True:
        obj = to_process.get()
        if obj is None:
            break
        total += obj

    from_process.put(total)


to_process = multiprocessing.Queue()
from_process = multiprocessing.Queue()

starttime = time.perf_counter()

process = multiprocessing.Process(
    target=in_process, args=(to_process, from_process)
)
process.start()

for x in range(10000000):
    to_process.put(x)

to_process.put(None)

total = from_process.get()
print(time.perf_counter() - starttime)

Sending 10 million integers to one subinterpreter using a Queue took 6.1 seconds, whereas sending 10 million integers to one process using a Queue took 43.0 seconds. That’s a factor of 7 in subinterpreter’s favor, and I expected something like that.

For completeness, here’s a script to get a baseline (single-threaded, performing the same computation):

starttime = time.perf_counter()

total = 0
for x in range(10000000):
    total += x

print(f"main {total = }")

print(time.perf_counter() - starttime)

It took 0.8 seconds, so the work that the subinterpreter Queue is doing to send data is about 8× more expensive than adding integers; about 0.5 ms per queue item ((6.1 - 0.8) / 10000000). Not bad!

The next thing that would be interesting to test is the scaling of Python code that updates a shared array in subinterpreters. I personally believe that the scaling would be close to perfect (only run into issues at ~1 Gbps due to all threads trying to pull data over the same memory bus, like a C program), but I couldn’t test it because I don’t know how to install a package like NumPy with a manually compiled Python (I normally let Conda manage the environments) and this workaround using ctypes:

big_array = array.array("i")
big_array.fromfile(open("/tmp/numbers.int32", "rb"), 16*10000000)
pointer, _ = big_array.buffer_info()

# pass the (integer) pointer to the subinterpreter and

big_array = (ctypes.c_int32 * (16*10000000)).from_address(pointer)

didn’t work because

Traceback (most recent call last):
  File "/home/jpivarski/tmp/subinterpreters/subinterp-multithread.py", line 24, in in_subinterp
    import ctypes
  File "/home/jpivarski/tmp/subinterpreters/Python-3.13.0b1/Lib/ctypes/__init__.py", line 8, in <module>
    from _ctypes import Union, Structure, Array
ImportError: module _ctypes does not support loading in subinterpreters

If all of the performance results above are as expected, then it suggests a usage strategy for subinterpreters:

Is that right? This is a question for the subinterpreter developers—are these rough performance numbers and interpretations in line with what you expect?