Extensible Scheduler Class — The Linux Kernel documentation (original) (raw)

sched_ext is a scheduler class whose behavior can be defined by a set of BPF programs - the BPF scheduler.

Switching to and from sched_ext

CONFIG_SCHED_CLASS_EXT is the config option to enable sched_ext andtools/sched_ext contains the example schedulers. The following config options should be enabled to use sched_ext:

CONFIG_BPF=y CONFIG_SCHED_CLASS_EXT=y CONFIG_BPF_SYSCALL=y CONFIG_BPF_JIT=y CONFIG_DEBUG_INFO_BTF=y CONFIG_BPF_JIT_ALWAYS_ON=y CONFIG_BPF_JIT_DEFAULT_ON=y

sched_ext is used only when the BPF scheduler is loaded and running.

If a task explicitly sets its scheduling policy to SCHED_EXT, it will be treated as SCHED_NORMAL and scheduled by the fair-class scheduler until the BPF scheduler is loaded.

When the BPF scheduler is loaded and SCX_OPS_SWITCH_PARTIAL is not set in ops->flags, all SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE, andSCHED_EXT tasks are scheduled by sched_ext.

However, when the BPF scheduler is loaded and SCX_OPS_SWITCH_PARTIAL is set in ops->flags, only tasks with the SCHED_EXT policy are scheduled by sched_ext, while tasks with SCHED_NORMAL, SCHED_BATCH andSCHED_IDLE policies are scheduled by the fair-class scheduler which has higher sched_class precedence than SCHED_EXT.

Terminating the sched_ext scheduler program, triggering SysRq-S, or detection of any internal error including stalled runnable tasks aborts the BPF scheduler and reverts all tasks back to the fair-class scheduler.

make -j16 -C tools/sched_ext

tools/sched_ext/build/bin/scx_simple

local=0 global=3 local=5 global=24 local=9 global=44 local=13 global=56 local=17 global=72 ^CEXIT: BPF scheduler unregistered

The current status of the BPF scheduler can be determined as follows:

cat /sys/kernel/sched_ext/state

enabled

cat /sys/kernel/sched_ext/root/ops

simple

You can check if any BPF scheduler has ever been loaded since boot by examining this monotonically incrementing counter (a value of zero indicates that no BPF scheduler has been loaded):

cat /sys/kernel/sched_ext/enable_seq

1

Each running scheduler also exposes a per-scheduler events file under/sys/kernel/sched_ext/<scheduler-name>/events that tracks diagnostic counters. Each counter occupies one name value line:

cat /sys/kernel/sched_ext/simple/events

SCX_EV_SELECT_CPU_FALLBACK 0 SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0 SCX_EV_DISPATCH_KEEP_LAST 123 SCX_EV_ENQ_SKIP_EXITING 0 SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0 SCX_EV_REENQ_IMMED 0 SCX_EV_REENQ_LOCAL_REPEAT 0 SCX_EV_REFILL_SLICE_DFL 456789 SCX_EV_BYPASS_DURATION 0 SCX_EV_BYPASS_DISPATCH 0 SCX_EV_BYPASS_ACTIVATE 0 SCX_EV_INSERT_NOT_OWNED 0 SCX_EV_SUB_BYPASS_DISPATCH 0

The counters are described in kernel/sched/ext_internal.h; briefly:

tools/sched_ext/scx_show_state.py is a drgn script which shows more detailed information:

tools/sched_ext/scx_show_state.py

ops : simple enabled : 1 switching_all : 1 switched_all : 1 enable_state : enabled (2) bypass_depth : 0 nr_rejected : 0 enable_seq : 1

Whether a given task is on sched_ext can be determined as follows:

grep ext /proc/self/sched

ext.enabled : 1

The Basics

Userspace can implement an arbitrary BPF scheduler by loading a set of BPF programs that implement struct sched_ext_ops. The only mandatory field is ops.name which must be a valid BPF object name. All operations are optional. The following modified excerpt is fromtools/sched_ext/scx_simple.bpf.c showing a minimal global FIFO scheduler.

/*

{ s32 cpu; /* Need to initialize or the BPF verifier will reject the program */ bool direct = false;

    cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);

    if (direct)
            scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);

    return cpu;

}

/*

}

s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) { /* * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and * SCHED_BATCH tasks should use sched_ext. */ return 0; }

void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) { exit_type = ei->type; }

SEC(".struct_ops") struct sched_ext_ops simple_ops = { .select_cpu = (void *)simple_select_cpu, .enqueue = (void *)simple_enqueue, .init = (void *)simple_init, .exit = (void *)simple_exit, .name = "simple", };

Dispatch Queues

To match the impedance between the scheduler core and the BPF scheduler, sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a priority queue. By default, there is one global FIFO (SCX_DSQ_GLOBAL), and one local DSQ per CPU (SCX_DSQ_LOCAL). The BPF scheduler can manage an arbitrary number of DSQs using scx_bpf_create_dsq() andscx_bpf_destroy_dsq().

A CPU always executes a task from its local DSQ. A task is “inserted” into a DSQ. A task in a non-local DSQ is “move”d into the target CPU’s local DSQ.

When a CPU is looking for the next task to run, if the local DSQ is not empty, the first task is picked. Otherwise, the CPU tries to move a task from the global DSQ. If that doesn’t yield a runnable task either,ops.dispatch() is invoked.

Scheduling Cycle

The following briefly shows how a waking task is scheduled and executed.

  1. When a task is waking up, ops.select_cpu() is the first operation invoked. This serves two purposes. First, CPU selection optimization hint. Second, waking up the selected CPU if idle.
    The CPU selected by ops.select_cpu() is an optimization hint and not binding. The actual decision is made at the last step of scheduling. However, there is a small performance gain if the CPUops.select_cpu() returns matches the CPU the task eventually runs on.
    A side-effect of selecting a CPU is waking it up from idle. While a BPF scheduler can wake up any cpu using the scx_bpf_kick_cpu() helper, using ops.select_cpu() judiciously can be simpler and more efficient.
    Note that the scheduler core will ignore an invalid CPU selection, for example, if it’s outside the allowed cpumask of the task.
    A task can be immediately inserted into a DSQ from ops.select_cpu()by calling scx_bpf_dsq_insert() or scx_bpf_dsq_insert_vtime().
    If the task is inserted into SCX_DSQ_LOCAL fromops.select_cpu(), it will be added to the local DSQ of whichever CPU is returned from ops.select_cpu(). Additionally, inserting directly from ops.select_cpu() will cause the ops.enqueue() callback to be skipped.
    Any other attempt to store a task in BPF-internal data structures fromops.select_cpu() does not prevent ops.enqueue() from being invoked. This is discouraged, as it can introduce racy behavior or inconsistent state.
  2. Once the target CPU is selected, ops.enqueue() is invoked (unless the task was inserted directly from ops.select_cpu()). ops.enqueue()can make one of the following decisions:
    • Immediately insert the task into either the global or a local DSQ by calling scx_bpf_dsq_insert() with one of the following options:SCX_DSQ_GLOBAL, SCX_DSQ_LOCAL, or SCX_DSQ_LOCAL_ON | cpu.
    • Immediately insert the task into a custom DSQ by callingscx_bpf_dsq_insert() with a DSQ ID which is smaller than 2^63.
    • Queue the task on the BPF side.
      Task State Tracking and ops.dequeue() Semantics
      A task is in the “BPF scheduler’s custody” when the BPF scheduler is responsible for managing its lifecycle. A task enters custody when it is dispatched to a user DSQ or stored in the BPF scheduler’s internal data structures. Custody is entered only from ops.enqueue() for those operations. The only exception is dispatching to a user DSQ fromops.select_cpu(): although the task is not yet technically in BPF scheduler custody at that point, the dispatch has the same semantic effect as dispatching from ops.enqueue() for custody-related purposes.
      Once ops.enqueue() is called, the task may or may not enter custody depending on what the scheduler does:
    • Directly dispatched to terminal DSQs (SCX_DSQ_LOCAL,SCX_DSQ_LOCAL_ON | cpu, or SCX_DSQ_GLOBAL): the BPF scheduler is done with the task - it either goes straight to a CPU’s local run queue or to the global DSQ as a fallback. The task never enters (or exits) BPF custody, and ops.dequeue() will not be called.
    • Dispatch to user-created DSQs (custom DSQs): the task enters the BPF scheduler’s custody. When the task later leaves BPF custody (dispatched to a terminal DSQ, picked by core-sched, or dequeued for sleep/property changes), ops.dequeue() will be called exactly once.
    • Stored in BPF data structures (e.g., internal BPF queues): the task is in BPF custody. ops.dequeue() will be called when it leaves (e.g., when ops.dispatch() moves it to a terminal DSQ, or on property change / sleep).
      When a task leaves BPF scheduler custody, ops.dequeue() is invoked. The dequeue can happen for different reasons, distinguished by flags:
    1. Regular dispatch: when a task in BPF custody is dispatched to a terminal DSQ from ops.dispatch() (leaving BPF custody for execution), ops.dequeue() is triggered without any special flags.
    2. Core scheduling pick: when CONFIG_SCHED_CORE is enabled and core scheduling picks a task for execution while it’s still in BPF custody, ops.dequeue() is called with theSCX_DEQ_CORE_SCHED_EXEC flag.
    3. Scheduling property change: when a task property changes (via operations like sched_setaffinity(), sched_setscheduler(), priority changes, CPU migrations, etc.) while the task is still in BPF custody, ops.dequeue() is called with theSCX_DEQ_SCHED_CHANGE flag set in deq_flags.
      Important: Once a task has left BPF custody (e.g., after being dispatched to a terminal DSQ), property changes will not triggerops.dequeue(), since the task is no longer managed by the BPF scheduler.
  3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn’t a task to run, ops.dispatch() is invoked which can use the following two functions to populate the local DSQ.
    • scx_bpf_dsq_insert() inserts a task to a DSQ. Any target DSQ can be used - SCX_DSQ_LOCAL, SCX_DSQ_LOCAL_ON | cpu,SCX_DSQ_GLOBAL or a custom DSQ. While scx_bpf_dsq_insert()currently can’t be called with BPF locks held, this is being worked on and will be supported. scx_bpf_dsq_insert() schedules insertion rather than performing them immediately. There can be up toops.dispatch_max_batch pending tasks.
    • scx_bpf_dsq_move_to_local() moves a task from the specified non-local DSQ to the dispatching DSQ. This function cannot be called with any BPF locks held. scx_bpf_dsq_move_to_local() flushes the pending insertions tasks before trying to move from the specified DSQ.
  4. After ops.dispatch() returns, if there are tasks in the local DSQ, the CPU runs the first one. If empty, the following steps are taken:
    • Try to move from the global DSQ. If successful, run the task.
    • If ops.dispatch() has dispatched any tasks, retry #3.
    • If the previous task is an SCX task and still runnable, keep executing it (see SCX_OPS_ENQ_LAST).
    • Go idle.

Note that the BPF scheduler can always choose to dispatch tasks immediately in ops.enqueue() as illustrated in the above simple example. If only the built-in DSQs are used, there is no need to implement ops.dispatch() as a task is never queued on the BPF scheduler and both the local and global DSQs are executed automatically.

scx_bpf_dsq_insert() inserts the task on the FIFO of the target DSQ. Usescx_bpf_dsq_insert_vtime() for the priority queue. Internal DSQs such asSCX_DSQ_LOCAL and SCX_DSQ_GLOBAL do not support priority-queue dispatching, and must be dispatched to with scx_bpf_dsq_insert(). See the function documentation and usage in tools/sched_ext/scx_simple.bpf.cfor more information.

Task Lifecycle

The following pseudo-code presents a rough overview of the entire lifecycle of a task managed by a sched_ext scheduler:

ops.init_task(); /* A new task is created / ops.enable(); / Enable BPF scheduling for the task */

while (task in SCHED_EXT) { if (task can migrate) ops.select_cpu(); /* Called on wakeup (optimization) */

ops.runnable();         /* Task becomes ready to run */

while (task_is_runnable(task)) {
    if (task is not in a DSQ || task->scx.slice == 0) {
        ops.enqueue();  /* Task can be added to a DSQ */

        /* Task property change (i.e., affinity, nice, etc.)? */
        if (sched_change(task)) {
            ops.dequeue(); /* Exiting BPF scheduler custody */
            ops.quiescent();

            /* Property change callback, e.g. ops.set_weight() */

            ops.runnable();
            continue;
        }

        /* Any usable CPU becomes available */

        ops.dispatch();     /* Task is moved to a local DSQ */
        ops.dequeue();      /* Exiting BPF scheduler custody */
    }

    ops.running();      /* Task starts running on its assigned CPU */

    while (task_is_runnable(task) && task->scx.slice > 0) {
        ops.tick();     /* Called every 1/HZ seconds */

        if (task->scx.slice == 0)
            ops.dispatch(); /* task->scx.slice can be refilled */
    }

    ops.stopping();     /* Task stops running (time slice expires or wait) */
}

ops.quiescent();        /* Task releases its assigned CPU (wait) */

}

ops.disable(); /* Disable BPF scheduling for the task / ops.exit_task(); / Task is destroyed */

Note that the above pseudo-code does not cover all possible state transitions and edge cases, to name a few examples:

See the “Scheduling Cycle” section for a more detailed description of how a freshly woken up task gets on a CPU.

Where to Look

Module Parameters

sched_ext exposes two module parameters under the sched_ext. prefix that control bypass-mode behaviour. These knobs are primarily for debugging; there is usually no reason to change them during normal operation. They can be read and written at runtime (mode 0600) via/sys/module/sched_ext/parameters/.

sched_ext.slice_bypass_us (default: 5000 µs)

The time slice assigned to all tasks when the scheduler is in bypass mode, i.e. during BPF scheduler load, unload, and error recovery. Valid range is 100 µs to 100 ms.

sched_ext.bypass_lb_intv_us (default: 500000 µs)

The interval at which the bypass-mode load balancer redistributes tasks across CPUs. Set to 0 to disable load balancing during bypass mode. Valid range is 0 to 10 s.

ABI Instability

The APIs provided by sched_ext to BPF schedulers programs have no stability guarantees. This includes the ops table callbacks and constants defined ininclude/linux/sched/ext.h, as well as the scx_bpf_ kfuncs defined inkernel/sched/ext.c and kernel/sched/ext_idle.c.

While we will attempt to provide a relatively stable API surface when possible, they are subject to change without warning between kernel versions.