A Minimal Scheduler with eBPF, sched_ext and C#

This tutorial provides a hands-on introduction to writing a Linux scheduler directly in C using eBPF and the sched_ext framework. We’ll build a minimal scheduler that uses a global scheduling queue from which every CPU gets its tasks to run for a time slice, implementing a First-In-First-Out (FIFO) round-robin scheduler.

Round-Robin Scheduling Overview#

1
graph TB
2
    subgraph "Round-Robin Scheduler"
3
        Queue["Global Task Queue<br/>(FIFO)"]
4
        CPU1["CPU 1"]
5
        CPU2["CPU 2"]
6
        CPU3["CPU 3"]
7
        CPU4["CPU 4"]
8

9
        Queue --> CPU1
10
        Queue --> CPU2
11
        Queue --> CPU3
12
        Queue --> CPU4
13

14
        CPU1 --> Queue
15
        CPU2 --> Queue
16
        CPU3 --> Queue
17
        CPU4 --> Queue
18
    end
19

20
    style Queue fill:#e1f5fe
21
    style CPU1 fill:#f3e5f5
22
    style CPU2 fill:#f3e5f5
23
    style CPU3 fill:#f3e5f5
24
    style CPU4 fill:#f3e5f5

Requirements#

To build and run a custom scheduler, you’ll need:

System Requirements#

Linux Kernel 6.12 or a patched 6.11 kernel with sched_ext support
Recent clang compiler for eBPF compilation
bpftool for attaching the scheduler

Installation on Ubuntu#

1
apt install clang linux-tools-common linux-tools-$(uname -r)

Getting the Code#

1
git clone https://github.com/parttimenerd/minimal-scheduler
2
cd minimal-scheduler

Quick Start#

The scheduler implementation consists of three simple steps:

1
# Build the scheduler binary
2
./build.sh
3

4
# Start the scheduler (requires root)
5
sudo ./start.sh
6

7
# Stop the scheduler when done
8
sudo ./stop.sh

The Scheduler Implementation#

Let’s dive into the core scheduler code in sched_ext.bpf.c:

1
// Auto-generated header containing kernel structures
2
#include <vmlinux.h>
3
// Linux BPF helper functions
4
#include <bpf/bpf_helpers.h>
5
#include <bpf/bpf_tracing.h>
6

7
// Define a shared Dispatch Queue (DSQ) ID
8
// This serves as our global scheduling queue
9
#define SHARED_DSQ_ID 0
10

11
// Macros for cleaner code and proper binary section placement
12
#define BPF_STRUCT_OPS(name, args...)  \
13
    SEC("struct_ops/"#name)  BPF_PROG(name, ##args)
14

15
#define BPF_STRUCT_OPS_SLEEPABLE(name, args...)  \
16
    SEC("struct_ops.s/"#name)                    \
17
    BPF_PROG(name, ##args)
18

19
// Initialize the scheduler by creating a shared dispatch queue
20
s32 BPF_STRUCT_OPS_SLEEPABLE(sched_init) {
21
    // Create a shared DSQ that all CPUs can access
22
    return scx_bpf_create_dsq(SHARED_DSQ_ID, -1);
23
}
24

25
// Enqueue a task that wants to run
26
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
27
    // Dynamic time slice calculation for better responsiveness
28
    // Base time slice: 5ms, adjusted by queue length
29
    u64 slice = 5000000u / scx_bpf_dsq_nr_queued(SHARED_DSQ_ID);
30
    scx_bpf_dispatch(p, SHARED_DSQ_ID, slice, enq_flags);
31
    return 0;
32
}
33

34
// Dispatch a task from shared DSQ to CPU when CPU becomes available
35
int BPF_STRUCT_OPS(sched_dispatch, s32 cpu, struct task_struct *prev) {
36
    scx_bpf_consume(SHARED_DSQ_ID);
37
    return 0;
38
}
39

40
// Main scheduler operations structure
41
SEC(".struct_ops.link")
42
struct sched_ext_ops sched_ops = {
43
    .enqueue   = (void *)sched_enqueue,
44
    .dispatch  = (void *)sched_dispatch,
45
    .init      = (void *)sched_init,
46
    .flags     = SCX_OPS_ENQ_LAST | SCX_OPS_KEEP_BUILTIN_IDLE,
47
    .name      = "minimal_scheduler"
48
};
49

50
// GPL license required for all schedulers
51
char _license[] SEC("license") = "GPL";

Scheduler Flow Visualization#

1
sequenceDiagram
2
    participant T as Task
3
    participant S as Scheduler
4
    participant Q as Global Queue
5
    participant C as CPU
6

7
    Note over S: sched_init()
8
    S->>Q: Create shared DSQ
9

10
    Note over T: Task wants to run
11
    T->>S: Request scheduling
12
    S->>S: sched_enqueue()
13
    S->>S: Calculate time slice
14
    S->>Q: Add task to queue
15

16
    Note over C: CPU becomes available
17
    C->>S: Need task to run
18
    S->>S: sched_dispatch()
19
    S->>Q: Get next task
20
    Q->>C: Assign task
21

22
    Note over C: Time slice expires
23
    C->>S: Task time up
24
    S->>Q: Return to queue (if needed)

Building and Running#

Build Process (`build.sh`)#

1
# Generate kernel headers
2
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
3

4
# Compile scheduler to BPF bytecode
5
clang -target bpf -g -O2 -c sched_ext.bpf.c -o sched_ext.bpf.o -I.

Starting the Scheduler (`start.sh`)#

1
# Register the scheduler
2
bpftool struct_ops register sched_ext.bpf.o /sys/fs/bpf/sched_ext

Verification#

1
# Check active scheduler
2
cat /sys/kernel/sched_ext/root/ops
3
# Output: minimal_scheduler
4

5
# Check kernel messages
6
sudo dmesg | tail
7
# Should show: sched_ext: BPF scheduler "minimal_scheduler" enabled

Stopping the Scheduler (`stop.sh`)#

1
# Remove the scheduler registration
2
rm /sys/fs/bpf/sched_ext/sched_ops

Scheduler Architecture#

1
graph TD
2
    subgraph "eBPF Scheduler Architecture"
3
        Init["sched_init()<br/>Initialize DSQ"]
4
        Enqueue["sched_enqueue()<br/>Add tasks to queue"]
5
        Dispatch["sched_dispatch()<br/>Assign tasks to CPUs"]
6

7
        subgraph "Global State"
8
            DSQ["Shared Dispatch Queue<br/>(FIFO)"]
9
            TS["Time Slice Calculator"]
10
        end
11

12
        subgraph "System Interface"
13
            Tasks["Incoming Tasks"]
14
            CPUs["Available CPUs"]
15
        end
16

17
        Init --> DSQ
18
        Tasks --> Enqueue
19
        Enqueue --> TS
20
        TS --> DSQ
21
        CPUs --> Dispatch
22
        Dispatch --> DSQ
23
    end
24

25
    style Init fill:#e8f5e8
26
    style Enqueue fill:#fff3e0
27
    style Dispatch fill:#f3e5f5
28
    style DSQ fill:#e1f5fe

Practical Experiments#

1. Vary the Time Slice#

Experiment with different time slice values to understand system behavior:

1
// Large time slice (1 second)
2
u64 slice = 1000000000u; // 1s
3

4
// Small time slice (100 microseconds)
5
u64 slice = 100000u; // 100us
6

7
// Adaptive time slice (current implementation)
8
u64 slice = 5000000u / scx_bpf_dsq_nr_queued(SHARED_DSQ_ID);

Observations:

Large time slices: Less responsive UI, but potentially better throughput
Small time slices: More responsive, but higher context switch overhead

2. Fixed vs. Adaptive Time Slice#

Compare fixed time slice with adaptive scheduling:

1
// Fixed time slice implementation
2
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
3
    u64 slice = 5000000u; // Fixed 5ms
4
    scx_bpf_dispatch(p, SHARED_DSQ_ID, slice, enq_flags);
5
    return 0;
6
}

3. CPU Affinity Control#

Limit scheduling to specific CPUs:

1
int BPF_STRUCT_OPS(sched_dispatch, s32 cpu, struct task_struct *prev) {
2
    // Only schedule on CPU 0 (single-core simulation)
3
    if (cpu == 0) {
4
        scx_bpf_consume(SHARED_DSQ_ID);
5
    }
6
    return 0;
7
}

4. Multi-Queue Architecture#

Implement multiple scheduling queues:

1
#define HIGH_PRIORITY_DSQ 0
2
#define LOW_PRIORITY_DSQ  1
3

4
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
5
    u32 dsq_id = (p->tgid % 2) ? HIGH_PRIORITY_DSQ : LOW_PRIORITY_DSQ;
6
    u64 slice = 5000000u;
7
    scx_bpf_dispatch(p, dsq_id, slice, enq_flags);
8
    return 0;
9
}

Advanced Features#

Task Lifecycle Hooks#

Monitor task execution with additional hooks:

1
// Called when task starts running
2
int BPF_STRUCT_OPS(sched_running, struct task_struct *p) {
3
    u32 cpu = smp_processor_id();
4
    bpf_trace_printk("Task %d started on CPU %d\n", p->pid, cpu);
5
    return 0;
6
}
7

8
// Called when task stops running
9
int BPF_STRUCT_OPS(sched_stopping, struct task_struct *p, bool runnable) {
10
    bpf_trace_printk("Task %d stopped, runnable: %d\n", p->pid, runnable);
11
    return 0;
12
}

Performance Monitoring#

Add performance counters and statistics:

1
// BPF map for collecting scheduler statistics
2
struct {
3
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
4
    __uint(max_entries, 4);
5
    __type(key, u32);
6
    __type(value, u64);
7
} stats_map SEC(".maps");
8

9
// Increment scheduling events
10
void increment_stat(u32 stat_type) {
11
    u64 *count = bpf_map_lookup_elem(&stats_map, &stat_type);
12
    if (count) {
13
        (*count)++;
14
    }
15
}

Debugging and Troubleshooting#

Common Issues#

Scheduler Not Loading

1
# Check kernel version
2
uname -r
3
# Verify sched_ext support
4
ls /sys/kernel/sched_ext/

Permission Errors

1
# Ensure running as root
2
sudo ./start.sh
3
# Check file permissions
4
ls -la sched_ext.bpf.o

Compilation Errors

1
# Verify clang version
2
clang --version
3
# Check for missing headers
4
find /usr -name "bpf_helpers.h" 2>/dev/null

Monitoring Scheduler Behavior#

1
# Watch scheduler statistics
2
watch -n 1 'cat /proc/schedstat'
3

4
# Monitor context switches
5
vmstat 1
6

7
# Trace scheduler events
8
sudo perf trace -e sched:*

Performance Considerations#

Time Slice Optimization#

1
graph LR
2
    subgraph "Time Slice Impact"
3
        A["Large Time Slice<br/>(>50ms)"] --> B["High Throughput<br/>Low Responsiveness"]
4
        C["Small Time Slice<br/>(<1ms)"] --> D["High Responsiveness<br/>High Overhead"]
5
        E["Adaptive Time Slice<br/>(1-10ms)"] --> F["Balanced Performance"]
6
    end
7

8
    style A fill:#ffcdd2
9
    style C fill:#ffcdd2
10
    style E fill:#c8e6c9

Queue Management#

FIFO Queue: Simple but may cause priority inversion
Priority Queues: Better for real-time systems
Multi-level Queues: Good for mixed workloads

Extended Exercises#

1. Priority Scheduling#

Implement a priority-based scheduler using task priority:

1
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
2
    u32 dsq_id = (p->prio < 120) ? HIGH_PRIORITY_DSQ : LOW_PRIORITY_DSQ;
3
    u64 slice = (p->prio < 120) ? 10000000u : 5000000u;
4
    scx_bpf_dispatch(p, dsq_id, slice, enq_flags);
5
    return 0;
6
}

2. Load Balancing#

Implement basic load balancing across CPUs:

1
int BPF_STRUCT_OPS(sched_dispatch, s32 cpu, struct task_struct *prev) {
2
    // Only consume if this CPU has fewer tasks than average
3
    u32 cpu_load = get_cpu_load(cpu);
4
    if (cpu_load < get_average_load()) {
5
        scx_bpf_consume(SHARED_DSQ_ID);
6
    }
7
    return 0;
8
}

3. Bandwidth Control#

Implement CPU bandwidth limiting:

1
struct task_bandwidth {
2
    u64 allocated_time;
3
    u64 used_time;
4
    u64 last_update;
5
};
6

7
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
8
    struct task_bandwidth *bw = get_task_bandwidth(p->pid);
9
    if (bw && bw->used_time >= bw->allocated_time) {
10
        // Task exceeded bandwidth, lower priority
11
        scx_bpf_dispatch(p, LOW_PRIORITY_DSQ, 1000000u, enq_flags);
12
    } else {
13
        scx_bpf_dispatch(p, SHARED_DSQ_ID, 5000000u, enq_flags);
14
    }
15
    return 0;
16
}

Resources and Further Reading#

Official Documentation#

Advanced Topics#

Rust Implementation: scx_rust_scheduler
Java Implementation: hello-ebpf
Real-time Scheduling: RT-sched_ext extensions

Books and Tutorials#

Conclusion#

This tutorial demonstrated how surprisingly simple it is to create a custom Linux scheduler using eBPF and sched_ext. The minimal implementation shows the core concepts while providing a foundation for more sophisticated scheduling algorithms.

Key takeaways:

Simplicity: Basic schedulers require minimal code
Flexibility: Easy to experiment with different algorithms
Safety: eBPF verifier ensures system stability
Performance: Direct kernel integration provides excellent performance

The combination of eBPF’s safety guarantees and sched_ext’s scheduling framework makes kernel development more accessible than ever before.

Adapted from the original tutorial by Johannes Bechberger on Mostly Nerdless