Add asyncio monitoring hooks (original) (raw)

Similar previous discussion

As an APM vendor (Datadog), I’d love to be able to provide deeper insights/metrics about the asyncio event loop for our customers.

Some of the metrics I have been thinking about so far are:

Over the lifetime of the process
- total number of tasks created/finished
- total number of context switches
- min/avg/p95/p99/max/etc blocking time per-task run
Per-task
- total number of context switches
- total blocking vs idle time

I’d also like to be able to associate the per-task metrics on any active APM tracing spans (“span” = duration + metadata measuring a function or arbitrary code block). This way I can show when a traced function is responsible for blocking the event loop for a certain period of time (similar concept to the existing debug logging in asyncio).

I wasn’t able to figure out a great way to get these metrics from the existing implementation/API (if I missed something, please let me know!).

I made an attempt at a reference implementation to validate one potential approach.

This approach adds sys.set_async_monitoring_hooks/sys.get_async_monitoring_hooks which allows users to add hooks for the following events:

task_enter(task) - Called when Modules/_asynciomodule.c:enter_task is called, a task is being context switched to.
task_leave(task) - Called when Modules/_asynciomodule.c:leave_task is called, a task is being context switched away from.
task_register(task) - Called when Modules/_asynciomodule.c:register_task is called.
task_unregister(task) - Called when Modules/_asynciomodule.c:unregister_task is called.

task_register and task_unregister don’t appear to be called as part of “normal” standard library only asyncio usage, so I am not convinced of their value here, but I added them to play with them.

Example usage

This is a very rudimentary example to show the high level API usage.

import asyncio
import sys


def task_enter(task: asyncio.Task):
    print("Entering task:", task.get_name())


def task_leave(task: asyncio.Task):
    if task.done():
        print("Task completed:", task.get_name())
    else:
        print("Leaving task:", task.get_name())


sys.set_async_monitoring_hooks(
    task_register=None,
    task_enter=task_enter,
    task_leave=task_leave,
    task_unregister=None,
)


async def task():
    pass


async def main():
    await asyncio.gather(*[task() for _ in range(100)])


if __name__ == "__main__":
    loop = asyncio.new_event_loop()
    loop.run_until_complete(main())

Why event hooks vs built-in metrics?

An obvious alternative here is to have asyncio maintain a set of metrics and expose them via an API that can be polled when the data is needed.
I think this is a good idea in the long run, but I thought adding the hooks could be a good intermediate step which allows the community to try generating their own metrics which we can then validate exactly which are valuable to have built-in.

As well, for some of my own specific use cases built-in, metrics would not be enough. For an APM tracing product, I’d like to be able to compute/access these metrics on a per-task basis while the task is still active to associate with any current active spans (I basically need to access contextvars from the currently active task).

Prior Art

Greenlet offers a settrace hook that allows you to register a function that gets called every time a context switch occurs.

Example:

import greenlet

def callback(event, args):
    if event in ('switch', 'throw'):
        origin, target = args
        print("Transfer from %s to %s with %s"
              % (origin, target, event))
              
greenlet.settrace(callback)

I may have missed something, but I had trouble consistently getting the current and next tasks to be able to offer a similar API with a single hook (e.g. callback(from_task, to_task)).

brettlangdon (Brett Langdon) May 4, 2023, 1:27pm 2

Hey, @orf, @dimaqq, @steve.dower, I hope you don’t mind me pinging you on this, since you all proposed/discussed a similar topic before I’d be curious to hear your feedback!

dimaqq (Dima Tisnek) May 9, 2023, 10:28pm 3

Yes, please, I’d like to see async/await instrumentation possible

The approach you highlighted makes sense to me, it’s reasonably light-weight, yet potentially provides just enough information.

Debuggers would need other form of instrumentation, but that’s OK.

I think, perhaps, some restrictions may be necessary on the callbacks. For example, is it OK for the callback to cancel some task synchronously? What if it’s the task that’s being switched to? I’d rather ban task modifications in the callbacks entirely

brettlangdon (Brett Langdon) May 11, 2023, 11:59pm 4

Hey @dimaqq what type of instrumentation do you have in mind?

Do you have any examples of how you would utilize these hooks?\

This is a good call out. I am trying to think of use cases where you need to modify the tasks (outside of reading/writing contextvars data), but I don’t have anything concrete.

dimaqq (Dima Tisnek) May 12, 2023, 3:25am 5

My little attempt at instrumentation is now dead, due to changes at how pc is updated in the stack frame between py3.9 and 3.10.

It’s archived here GitHub - dimaqq/awaitwhat: Await, What? (shamless plug).

The idea was to take a snapshot of all tasks and coroutines and show what is waiting for what.

It was originally inspired by trying to debug a large async/await program that would be stuck or semi-stuck in production only (much traffic, long time to lockup).

A debugger may be interested in the very same – suspend the user program, and show the equivalent of all stack traces in a multithreaded program, except for asyncio, for all tasks.

Wrt., your proposal, all I can think of right now is, in fact, telemetry.

achimnol (Joongi Kim)

May 23, 2024, 3:17pm 6

There is another use case for fine-grained asyncio hooks to implement “task scope” to structure asyncio server applications.
Also, please have a look at aiomonitor.

brettlangdon (Brett Langdon) May 23, 2024, 4:13pm 7

@achimnol thanks for sharing the talk and aiomonitor. I had not dug into aiomonitor much before. I see it is relying heavily on the task factory which, as you obviously know, only allows limited hook points for monitoring task execution. Hard to get any finger grained than started/finished.

guido (Guido van Rossum) May 23, 2024, 6:59pm 8

Just found this thread.

Have you considered replacing the entire event loop? This is possible (e.g. IIUC uvloop does this), and gives you complete control. Of course you inherit from the standard event loop, but you can replace anything you need.

This gives you control without the need for API bike shedding and waiting for users to adopt the latest Python version.

If this doesn’t work for you let’s discuss what’s missing.

brettlangdon (Brett Langdon) May 23, 2024, 7:51pm 9

Hey @guido, thanks for replying.

My main goal is to try and monitor tasks regardless of which event loop is being used.

Since we are building observability products, we’d like our users to still have full control over which event loop they want to use. Ideally we always want to be able to offer our instrumentation as transparently as possible (e.g. “patch_all”/single entry bootstrap/setup functions are better than manually asking people to set an event loop or policy).

We’ve messed around with monkey patching and wrapping asyncio.set_event_loop or setting custom event loop policies, but we’ve run into issues with users setting their own loop or policy and removing ours. Trying to balance an automatic OOTB experience with little knowledge needed by the user and compatibility with the user’s existing codebase can be difficult.

Like you said, we have some things we do today, and options to get something more, but I don’t mind the delay from API bike shedding and waiting for users to upgrade. If we can help build more observability capabilities in Python, then that is a win in my book (and… building cool features with them gives us a way to nudge users to upgrade to newer Python versions).

I’m open to any/all suggestions, especially if my original idea/direction doesn’t make sense or isn’t in the best interest of Python for the long term.

achimnol (Joongi Kim) May 25, 2024, 12:01pm 10

The same argument applies here. Task factories can do many thing, but we cannot compose them (i.e., gets complicated if we want to have multiple extensions at the same time). Event loops are also same. Moreover, task/taskgroup implementation is not much extensible/open to adding multitudes of 3rd party stuffs.

alicederyn (Alice) May 25, 2024, 1:16pm 11

Task factories can compose provided they are written correctly.

achimnol (Joongi Kim) May 25, 2024, 2:05pm 12

Could you give some examples?

alicederyn (Alice) May 25, 2024, 2:16pm 13

Take monitoring. You want to time task methods, so you provide a task wrapper that delegates to another task but times how long calls make. Now in order to compose with other factories, you just need to store the current factory when you install your own, and delegate to it to create the tasks that your wrapper then delegates to.

Or suppose you’re needing to set up a thread local whenever a task is running (don’t ask ). Same deal: make a task wrapper, delegate to the task created by the factory that was there when you installed yourself.

These two know nothing about each other, but they will interoperate transparently.

alicederyn (Alice) May 25, 2024, 2:20pm 14

To be clear, I don’t consider this an argument against adding monitoring hooks. I think that’s a great goal.

achimnol (Joongi Kim) May 25, 2024, 7:22pm 15

Referring existing task factories in a task factory might be an interim solution, but I am still not sure about that composing task factories as you describe is an intended feature or use-case.

>>> async def main():
...   loop = asyncio.get_running_loop()
...   print(loop.get_task_factory())
...
>>> asyncio.run(main())
None

By default, loop.get_task_factory() is None. If a task factory is set, BaseEventLoop.create_task() will call our task factory, so it may cause an infinite recursion if we simply call loop.create_task() again to compose task factories. To make it fully transparent, we need to do something like BaseEventLoop.create_task() does and all potential task factory implementations must have a branch to handle this case.

This would couple the internal implementation of asyncio and the task factory provider. Also, the asyncio manual says instantiating asyncio.tasks.Task objects by ourselves are discouraged:

Use the high-level asyncio.create_task() function to create Tasks, or the low-level loop.create_task() or ensure_future() functions. Manual instantiation of Tasks is discouraged.

To write deep-monitoring libraries, we often need to touch some private implementations and APIs. But our goal and request is to improve asyncio to avoid it as much as possible.

achimnol (Joongi Kim) May 25, 2024, 7:28pm 16

Yes, task factory composition done correctly is not enough to implement fine-grained monitoring like tracking context switches between tasks/coroutines.

alicederyn (Alice) May 25, 2024, 8:57pm 17

You mean, create a task? Yes, I agree, that edge case is needed. It would be better if loop.get_task_factory() had an option to return a function that does this, instead of None.

However, the assertion was that task factories cannot be composed. This is not true.

Show me any evidence.

achimnol (Joongi Kim) May 25, 2024, 10:47pm 18

I meant “cannot” to include unintended/unsupported usage which may break at any time. What if other library author forgets something when adding task factories? Following your discretion, yes, we can compose task factories with caution. I just don’t like that we need this extra care to extend tasks.

I have experienced so many cases that a 3rd party shot my foot… Although given in Korean, I have done a PyCon talk about this: [PyCon KR 2019] Real-world asyncio - Speaker Deck. Also, my PyCon talk about aiomonitor has several troubleshooting cases: [PyCon APAC 2023] Improving Debuggability of Complex Asyncio Applications - Speaker Deck. I don’t want to go through the same pattern with task factories, which would happen with sloppy API designs relying on the library author’s extra caution.

For the original design purpose and expected usage of task factories in asyncio, I’d like to invite @yselivanov to this thread.

achimnol (Joongi Kim) May 28, 2024, 11:28am 19

Tinche (Tin Tvrtković) May 28, 2024, 12:27pm 20

Looking at the initial proposal and the discussion in the thread, my opinion is that some of these metrics the OP is looking for sound quite complex, in this case I’d probably reach for a custom task factory and/or a custom event loop. Also the OP proposal ( sys.set_async_monitoring_hooks) feels very global.

I think in practice folks use either the standard library event loop or uvloop. Have you considered exposing functions that wrap these two loops?

However, detecting CPU stalls in task steps seems like an extremely basic thing that we should look to support in a generic way. We keep educating our users about the dangers of event loop stalls if they hog the event loop, but we only have very rudimentary ways of detecting this. But it still feels like it should be a hook on the event loop itself, not necessarily asyncio.