Netflix eBPF Infrastructure Observability: Detecting Noisy Neighbors at Scale#

Netflix’s Compute and Performance Engineering teams have revolutionized infrastructure observability by leveraging eBPF to detect “noisy neighbors” in multi-tenant systems. This innovative approach enables continuous monitoring of process scheduling with minimal performance impact, addressing critical challenges in large-scale distributed environments.

The Multi-Tenant Challenge: Noisy Neighbors#

1
graph TB
2
    subgraph "Multi-Tenant System Challenges"
3
        subgraph "Container 1 - Normal Workload"
4
            C1[CPU: 20%] --> C1P[Performance: Good]
5
            C1M[Memory: 512MB] --> C1P
6
        end
7

8
        subgraph "Container 2 - Noisy Neighbor"
9
            C2[CPU: 95%] --> C2I[High CPU Usage]
10
            C2M[Memory: 2GB] --> C2I
11
            C2I --> Impact[Performance Impact]
12
        end
13

14
        subgraph "Container 3 - Affected"
15
            C3[CPU: 15%] --> C3P[Performance: Degraded]
16
            C3L[Latency: High] --> C3P
17
        end
18

19
        Impact --> C3P
20
        Impact --> C1P
21
    end
22

23
    style C2I fill:#ffcdd2
24
    style Impact fill:#ffcdd2
25
    style C3P fill:#fff3e0

What Are Noisy Neighbors?#

In multi-tenant systems, noisy neighbors are processes or containers that consume excessive resources, negatively impacting the performance of other workloads running on the same physical host. These can manifest as:

CPU-intensive processes that monopolize CPU cycles
Memory-hungry applications that cause memory pressure
I/O-bound workloads that saturate disk or network resources
Scheduler interference that disrupts process execution timing

Traditional Detection Challenges#

Conventional approaches to noisy neighbor detection face several limitations:

1
sequenceDiagram
2
    participant Problem as Performance Issue
3
    participant Alert as Alert System
4
    participant Engineer as Engineer
5
    participant Tools as Analysis Tools
6
    participant Resolution as Resolution
7

8
    Problem->>Alert: Performance degradation detected
9
    Alert->>Engineer: Alert triggered
10
    Engineer->>Tools: Deploy perf/profiling tools
11

12
    rect rgb(255, 205, 210)
13
        Note over Tools: High overhead analysis
14
        Note over Tools: Requires application restart
15
        Note over Tools: Post-incident deployment
16
    end
17

18
    Tools->>Engineer: Analysis results
19
    Engineer->>Resolution: Apply fixes
20

21
    Note over Problem,Resolution: Problem already impacted users

Key Limitations#

Reactive Nature: Tools are typically deployed after performance issues have already occurred
High Overhead: Analysis tools like perf introduce significant performance overhead
Expertise Requirements: Require specialized engineering knowledge to operate effectively
System Disruption: Often require application restarts or recompilation for instrumentation

Netflix’s eBPF Solution#

Netflix’s innovative approach uses eBPF to instrument the Linux kernel for continuous, low-overhead monitoring of the scheduler subsystem.

Architecture Overview#

1
graph TB
2
    subgraph "Kernel Space"
3
        subgraph "eBPF Hooks"
4
            H1[sched_wakeup] --> Metrics[Process Latency Calculation]
5
            H2[sched_wakeup_new] --> Metrics
6
            H3[sched_switch] --> Metrics
7
        end
8

9
        subgraph "Scheduler Events"
10
            E1[Process Ready] --> H1
11
            E2[New Process] --> H2
12
            E3[Context Switch] --> H3
13
        end
14
    end
15

16
    subgraph "User Space"
17
        subgraph "Data Processing"
18
            RB[Ring Buffer] --> Go[Go Application]
19
            Go --> Atlas[Atlas Metrics Backend]
20
        end
21

22
        subgraph "Monitoring"
23
            Atlas --> Dashboards[Monitoring Dashboards]
24
            Atlas --> Alerts[Alert Systems]
25
        end
26
    end
27

28
    Metrics --> RB
29

30
    style H1 fill:#e1f5fe
31
    style H2 fill:#e1f5fe
32
    style H3 fill:#e1f5fe
33
    style Go fill:#c8e6c9

Core eBPF Implementation#

1
#include <vmlinux.h>
2
#include <bpf/bpf_helpers.h>
3
#include <bpf/bpf_tracing.h>
4
#include <bpf/bpf_core_read.h>
5

6
// Process latency tracking structure
7
struct process_event {
8
    __u32 pid;
9
    __u32 tgid;
10
    __u32 cgroup_id;
11
    __u64 wakeup_time;
12
    __u64 schedule_time;
13
    __u64 latency_ns;
14
    __u32 preempted_by_pid;
15
    __u8 throttled;
16
    char comm[16];
17
};
18

19
// Maps for tracking process states
20
struct {
21
    __uint(type, BPF_MAP_TYPE_HASH);
22
    __uint(max_entries, 65536);
23
    __type(key, __u32);
24
    __type(value, __u64);
25
} wakeup_times SEC(".maps");
26

27
struct {
28
    __uint(type, BPF_MAP_TYPE_RINGBUF);
29
    __uint(max_entries, 4 * 1024 * 1024);
30
} events SEC(".maps");
31

32
// Track CPU quotas for cgroup throttling detection
33
struct {
34
    __uint(type, BPF_MAP_TYPE_HASH);
35
    __uint(max_entries, 10000);
36
    __type(key, __u32);
37
    __type(value, __u64);
38
} cgroup_quotas SEC(".maps");
39

40
// Hook: Process becomes ready to run
41
SEC("tp_btf/sched_wakeup")
42
int handle_sched_wakeup(u64 *ctx) {
43
    struct task_struct *task = (struct task_struct *)ctx[0];
44
    __u32 pid = BPF_CORE_READ(task, pid);
45
    __u64 now = bpf_ktime_get_ns();
46

47
    // Store wakeup timestamp
48
    bpf_map_update_elem(&wakeup_times, &pid, &now, BPF_ANY);
49

50
    return 0;
51
}
52

53
// Hook: New process becomes ready to run
54
SEC("tp_btf/sched_wakeup_new")
55
int handle_sched_wakeup_new(u64 *ctx) {
56
    struct task_struct *task = (struct task_struct *)ctx[0];
57
    __u32 pid = BPF_CORE_READ(task, pid);
58
    __u64 now = bpf_ktime_get_ns();
59

60
    // Store wakeup timestamp for new process
61
    bpf_map_update_elem(&wakeup_times, &pid, &now, BPF_ANY);
62

63
    return 0;
64
}
65

66
// Hook: Process is assigned CPU time
67
SEC("tp_btf/sched_switch")
68
int handle_sched_switch(u64 *ctx) {
69
    struct task_struct *prev = (struct task_struct *)ctx[1];
70
    struct task_struct *next = (struct task_struct *)ctx[2];
71

72
    __u32 next_pid = BPF_CORE_READ(next, pid);
73
    __u32 next_tgid = BPF_CORE_READ(next, tgid);
74
    __u32 prev_pid = BPF_CORE_READ(prev, pid);
75
    __u64 now = bpf_ktime_get_ns();
76

77
    // Look up wakeup time for the process being scheduled
78
    __u64 *wakeup_time = bpf_map_lookup_elem(&wakeup_times, &next_pid);
79
    if (!wakeup_time) {
80
        return 0;
81
    }
82

83
    // Calculate run queue latency
84
    __u64 latency = now - *wakeup_time;
85

86
    // Get cgroup information for container association
87
    __u32 cgroup_id = get_cgroup_id(next);
88

89
    // Check if process is being throttled due to CPU quota
90
    __u8 throttled = check_cgroup_throttling(cgroup_id);
91

92
    // Create event for user space processing
93
    struct process_event *event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
94
    if (!event) {
95
        goto cleanup;
96
    }
97

98
    event->pid = next_pid;
99
    event->tgid = next_tgid;
100
    event->cgroup_id = cgroup_id;
101
    event->wakeup_time = *wakeup_time;
102
    event->schedule_time = now;
103
    event->latency_ns = latency;
104
    event->preempted_by_pid = prev_pid;
105
    event->throttled = throttled;
106

107
    // Copy process name
108
    bpf_probe_read_kernel_str(event->comm, sizeof(event->comm),
109
                             BPF_CORE_READ(next, comm));
110

111
    bpf_ringbuf_submit(event, 0);
112

113
cleanup:
114
    // Clean up wakeup time tracking
115
    bpf_map_delete_elem(&wakeup_times, &next_pid);
116
    return 0;
117
}
118

119
// Helper function to get cgroup ID for container association
120
static __u32 get_cgroup_id(struct task_struct *task) {
121
    struct cgroup *cgrp = BPF_CORE_READ(task, cgroups, subsys[0], cgroup);
122
    return BPF_CORE_READ(cgrp, kn, id);
123
}
124

125
// Helper function to check if cgroup is being throttled
126
static __u8 check_cgroup_throttling(u32 cgroup_id) {
127
    __u64 *quota = bpf_map_lookup_elem(&cgroup_quotas, &cgroup_id);
128
    if (!quota) {
129
        return 0;
130
    }
131

132
    // Simplified throttling check - in practice, this would
133
    // examine CPU quota vs usage statistics
134
    return *quota > 0 ? 1 : 0;
135
}
136

137
char _license[] SEC("license") = "GPL";

User-Space Processing Application#

1
package main
2

3
import (
4
    "context"
5
    "encoding/binary"
6
    "fmt"
7
    "log"
8
    "os"
9
    "os/signal"
10
    "syscall"
11
    "time"
12

13
    "github.com/cilium/ebpf"
14
    "github.com/cilium/ebpf/link"
15
    "github.com/cilium/ebpf/ringbuf"
16
    "github.com/prometheus/client_golang/prometheus"
17
    "github.com/prometheus/client_golang/prometheus/promauto"
18
)
19

20
// Metrics for Atlas (Netflix's metrics backend)
21
var (
22
    processLatencyHistogram = promauto.NewHistogramVec(
23
        prometheus.HistogramOpts{
24
            Name: "scheduler_process_latency_microseconds",
25
            Help: "Process run queue latency in microseconds",
26
            Buckets: []float64{1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000},
27
        },
28
        []string{"container_id", "throttled"},
29
    )
30

31
    preemptionCounter = promauto.NewCounterVec(
32
        prometheus.CounterOpts{
33
            Name: "scheduler_preemptions_total",
34
            Help: "Total number of process preemptions",
35
        },
36
        []string{"preempted_container", "preempting_container"},
37
    )
38

39
    noisyNeighborAlerts = promauto.NewCounterVec(
40
        prometheus.CounterOpts{
41
            Name: "noisy_neighbor_alerts_total",
42
            Help: "Total noisy neighbor alerts generated",
43
        },
44
        []string{"container_id", "alert_type"},
45
    )
46
)
47

48
// Process event structure matching eBPF
49
type ProcessEvent struct {
50
    PID            uint32
51
    TGID           uint32
52
    CgroupID       uint32
53
    WakeupTime     uint64
54
    ScheduleTime   uint64
55
    LatencyNS      uint64
56
    PreemptedByPID uint32
57
    Throttled      uint8
58
    Comm           [16]int8
59
}
60

61
// Container performance tracking
62
type ContainerMetrics struct {
63
    ID                string
64
    LatencyP99        time.Duration
65
    LatencyAvg        time.Duration
66
    PreemptionRate    float64
67
    ThrottlingRate    float64
68
    LastUpdate        time.Time
69
}
70

71
type SchedulerMonitor struct {
72
    objs           *schedulerObjects
73
    links          []link.Link
74
    reader         *ringbuf.Reader
75
    containerCache map[uint32]*ContainerMetrics
76

77
    // Noisy neighbor detection thresholds
78
    latencyThreshold    time.Duration
79
    preemptionThreshold float64
80
}
81

82
func NewSchedulerMonitor() (*SchedulerMonitor, error) {
83
    // Load eBPF program
84
    spec, err := ebpf.LoadCollectionSpec("scheduler_monitor.o")
85
    if err != nil {
86
        return nil, fmt.Errorf("loading eBPF spec: %w", err)
87
    }
88

89
    objs := &schedulerObjects{}
90
    if err := spec.LoadAndAssign(objs, nil); err != nil {
91
        return nil, fmt.Errorf("loading eBPF objects: %w", err)
92
    }
93

94
    // Set up ring buffer reader
95
    reader, err := ringbuf.NewReader(objs.Events)
96
    if err != nil {
97
        return nil, fmt.Errorf("creating ring buffer reader: %w", err)
98
    }
99

100
    monitor := &SchedulerMonitor{
101
        objs:                objs,
102
        reader:              reader,
103
        containerCache:      make(map[uint32]*ContainerMetrics),
104
        latencyThreshold:    500 * time.Microsecond, // 500μs threshold
105
        preemptionThreshold: 10.0,                   // 10 preemptions/sec
106
    }
107

108
    return monitor, nil
109
}
110

111
func (sm *SchedulerMonitor) AttachPrograms() error {
112
    // Attach sched_wakeup tracepoint
113
    wakeupLink, err := link.Tracepoint(link.TracepointOptions{
114
        Group:   "sched",
115
        Name:    "sched_wakeup",
116
        Program: sm.objs.HandleSchedWakeup,
117
    })
118
    if err != nil {
119
        return fmt.Errorf("attaching sched_wakeup: %w", err)
120
    }
121
    sm.links = append(sm.links, wakeupLink)
122

123
    // Attach sched_wakeup_new tracepoint
124
    wakeupNewLink, err := link.Tracepoint(link.TracepointOptions{
125
        Group:   "sched",
126
        Name:    "sched_wakeup_new",
127
        Program: sm.objs.HandleSchedWakeupNew,
128
    })
129
    if err != nil {
130
        return fmt.Errorf("attaching sched_wakeup_new: %w", err)
131
    }
132
    sm.links = append(sm.links, wakeupNewLink)
133

134
    // Attach sched_switch tracepoint
135
    switchLink, err := link.Tracepoint(link.TracepointOptions{
136
        Group:   "sched",
137
        Name:    "sched_switch",
138
        Program: sm.objs.HandleSchedSwitch,
139
    })
140
    if err != nil {
141
        return fmt.Errorf("attaching sched_switch: %w", err)
142
    }
143
    sm.links = append(sm.links, switchLink)
144

145
    log.Println("Successfully attached eBPF programs to scheduler tracepoints")
146
    return nil
147
}
148

149
func (sm *SchedulerMonitor) ProcessEvents(ctx context.Context) {
150
    for {
151
        select {
152
        case <-ctx.Done():
153
            return
154
        default:
155
            record, err := sm.reader.Read()
156
            if err != nil {
157
                log.Printf("Error reading from ring buffer: %v", err)
158
                continue
159
            }
160

161
            sm.handleProcessEvent(record.RawSample)
162
        }
163
    }
164
}
165

166
func (sm *SchedulerMonitor) handleProcessEvent(data []byte) {
167
    if len(data) < binary.Size(ProcessEvent{}) {
168
        return
169
    }
170

171
    var event ProcessEvent
172
    err := binary.Read(bytes.NewReader(data), binary.LittleEndian, &event)
173
    if err != nil {
174
        log.Printf("Error parsing event: %v", err)
175
        return
176
    }
177

178
    // Convert latency to microseconds
179
    latencyMicros := float64(event.LatencyNS) / 1000.0
180

181
    // Get container ID from cgroup
182
    containerID := fmt.Sprintf("container_%d", event.CgroupID)
183

184
    // Update Prometheus metrics
185
    throttledLabel := "false"
186
    if event.Throttled == 1 {
187
        throttledLabel = "true"
188
    }
189

190
    processLatencyHistogram.WithLabelValues(containerID, throttledLabel).Observe(latencyMicros)
191

192
    // Update container metrics cache
193
    sm.updateContainerMetrics(event.CgroupID, time.Duration(event.LatencyNS), event.Throttled == 1)
194

195
    // Detect noisy neighbors
196
    sm.detectNoisyNeighbors(event)
197

198
    // Log high latency events
199
    if time.Duration(event.LatencyNS) > sm.latencyThreshold {
200
        log.Printf("High latency detected: PID=%d, Container=%s, Latency=%v, Throttled=%v",
201
            event.PID, containerID, time.Duration(event.LatencyNS), event.Throttled == 1)
202
    }
203
}
204

205
func (sm *SchedulerMonitor) updateContainerMetrics(cgroupID uint32, latency time.Duration, throttled bool) {
206
    metrics, exists := sm.containerCache[cgroupID]
207
    if !exists {
208
        metrics = &ContainerMetrics{
209
            ID: fmt.Sprintf("container_%d", cgroupID),
210
        }
211
        sm.containerCache[cgroupID] = metrics
212
    }
213

214
    // Update running averages (simplified implementation)
215
    metrics.LatencyAvg = (metrics.LatencyAvg + latency) / 2
216
    if latency > metrics.LatencyP99 {
217
        metrics.LatencyP99 = latency
218
    }
219

220
    if throttled {
221
        metrics.ThrottlingRate = (metrics.ThrottlingRate + 1.0) / 2
222
    }
223

224
    metrics.LastUpdate = time.Now()
225
}
226

227
func (sm *SchedulerMonitor) detectNoisyNeighbors(event ProcessEvent) {
228
    containerID := fmt.Sprintf("container_%d", event.CgroupID)
229
    latency := time.Duration(event.LatencyNS)
230

231
    // High latency alert
232
    if latency > sm.latencyThreshold && event.Throttled == 0 {
233
        noisyNeighborAlerts.WithLabelValues(containerID, "high_latency").Inc()
234

235
        log.Printf("NOISY NEIGHBOR ALERT: Container %s experiencing high latency (%v) - possible external interference",
236
            containerID, latency)
237
    }
238

239
    // Excessive preemption alert
240
    if event.PreemptedByPID != 0 {
241
        preemptingContainer := sm.getContainerForPID(event.PreemptedByPID)
242
        preemptionCounter.WithLabelValues(containerID, preemptingContainer).Inc()
243

244
        // Check if preemption rate is too high (simplified check)
245
        metrics := sm.containerCache[event.CgroupID]
246
        if metrics != nil && metrics.PreemptionRate > sm.preemptionThreshold {
247
            noisyNeighborAlerts.WithLabelValues(preemptingContainer, "excessive_preemption").Inc()
248

249
            log.Printf("NOISY NEIGHBOR ALERT: Container %s causing excessive preemptions to %s",
250
                preemptingContainer, containerID)
251
        }
252
    }
253

254
    // Throttling correlation alert
255
    if event.Throttled == 1 && latency > sm.latencyThreshold {
256
        log.Printf("PERFORMANCE ALERT: Container %s hitting CPU quota limits (throttled latency: %v)",
257
            containerID, latency)
258
    }
259
}
260

261
func (sm *SchedulerMonitor) getContainerForPID(pid uint32) string {
262
    // Simplified - in practice, would maintain PID->Container mapping
263
    return fmt.Sprintf("unknown_container_%d", pid)
264
}
265

266
func (sm *SchedulerMonitor) Close() {
267
    for _, l := range sm.links {
268
        l.Close()
269
    }
270
    sm.reader.Close()
271
    sm.objs.Close()
272
}
273

274
func main() {
275
    // Set up signal handling
276
    ctx, cancel := context.WithCancel(context.Background())
277
    defer cancel()
278

279
    sigChan := make(chan os.Signal, 1)
280
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
281

282
    // Initialize scheduler monitor
283
    monitor, err := NewSchedulerMonitor()
284
    if err != nil {
285
        log.Fatalf("Failed to create scheduler monitor: %v", err)
286
    }
287
    defer monitor.Close()
288

289
    // Attach eBPF programs
290
    if err := monitor.AttachPrograms(); err != nil {
291
        log.Fatalf("Failed to attach eBPF programs: %v", err)
292
    }
293

294
    log.Println("Netflix Scheduler Monitor started - detecting noisy neighbors...")
295

296
    // Start processing events
297
    go monitor.ProcessEvents(ctx)
298

299
    // Wait for signal
300
    <-sigChan
301
    log.Println("Shutting down...")
302
    cancel()
303
}

Key Performance Metric: Process Latency#

The cornerstone of Netflix’s noisy neighbor detection is process latency, specifically run queue latency:

Run Queue Latency: The time processes spend in the scheduling queue before being dispatched to the CPU.

Latency Calculation Process#

1
sequenceDiagram
2
    participant Process as Process
3
    participant Scheduler as Linux Scheduler
4
    participant eBPF as eBPF Program
5
    participant Metrics as Metrics System
6

7
    Process->>Scheduler: Process becomes ready (sched_wakeup)
8
    Scheduler->>eBPF: Hook triggered
9
    eBPF->>eBPF: Store timestamp T1
10

11
    Note over Process,Scheduler: Process waits in run queue
12

13
    Scheduler->>Process: CPU assigned (sched_switch)
14
    Scheduler->>eBPF: Hook triggered
15
    eBPF->>eBPF: Calculate latency: T2 - T1
16
    eBPF->>Metrics: Submit latency data
17

18
    rect rgb(200, 230, 201)
19
        Note over eBPF,Metrics: Process latency = T2 - T1
20
    end

Beyond Simple Latency: Context-Aware Analysis#

Netflix’s solution goes beyond simple latency measurement by incorporating contextual information:

Container Association via cgroups#

1
// Extract cgroup information for container correlation
2
static __u32 get_container_id(struct task_struct *task) {
3
    struct cgroup *cgrp = BPF_CORE_READ(task, cgroups, subsys[0], cgroup);
4
    return BPF_CORE_READ(cgrp, kn, id);
5
}

Preemption Tracking#

1
// Track which process caused preemption
2
struct preemption_event {
3
    __u32 preempted_pid;        // Process that was preempted
4
    __u32 preempting_pid;       // Process that caused preemption
5
    __u32 preempted_container;  // Container being preempted
6
    __u32 preempting_container; // Container causing preemption
7
    __u64 timestamp;
8
};

CPU Quota Correlation#

The system distinguishes between latency caused by:

Noisy neighbors: External processes consuming resources
CPU quota limits: Container hitting its allocated CPU limits

1
// Check if latency is due to throttling vs. external interference
2
static __u8 is_throttled_latency(struct task_struct *task) {
3
    struct cgroup *cgrp = get_task_cgroup(task);
4
    struct cfs_rq *cfs_rq = &cgrp->cfs_rq;
5

6
    // Check if CFS throttling is active
7
    return BPF_CORE_READ(cfs_rq, throttled) ? 1 : 0;
8
}

Performance Impact and Optimization#

Overhead Analysis#

Netflix conducted extensive performance testing to ensure their eBPF monitoring didn’t become a performance bottleneck itself:

1
graph TB
2
    subgraph "Performance Metrics"
3
        subgraph "Hook Overhead"
4
            H1[sched_wakeup: <100ns] --> Total[Total: <600ns]
5
            H2[sched_wakeup_new: <50ns] --> Total
6
            H3[sched_switch: <450ns] --> Total
7
        end
8

9
        subgraph "System Impact"
10
            Total --> Impact[CPU Overhead: <0.1%]
11
            Memory[Memory: 2-4MB] --> Impact
12
            Network[Network: Minimal] --> Impact
13
        end
14

15
        subgraph "Comparison"
16
            Impact --> Better[10x better than perf]
17
            Better --> Continuous[Enables continuous monitoring]
18
        end
19
    end
20

21
    style Total fill:#c8e6c9
22
    style Impact fill:#c8e6c9
23
    style Continuous fill:#c8e6c9

Key Optimizations Implemented#

1. Efficient Data Structures#

1
// Optimized hash map for process tracking
2
struct {
3
    __uint(type, BPF_MAP_TYPE_LRU_HASH);  // LRU eviction
4
    __uint(max_entries, 65536);           // Sized for workload
5
    __type(key, __u32);
6
    __type(value, __u64);
7
} wakeup_times SEC(".maps");

2. Ring Buffer Communication#

1
// High-performance ring buffer for event streaming
2
struct {
3
    __uint(type, BPF_MAP_TYPE_RINGBUF);
4
    __uint(max_entries, 4 * 1024 * 1024);  // 4MB buffer
5
} events SEC(".maps");

3. Sampling and Rate Limiting#

1
// Rate limiting for high-frequency events
2
type RateLimiter struct {
3
    events    int64
4
    lastReset time.Time
5
    limit     int64
6
}
7

8
func (rl *RateLimiter) Allow() bool {
9
    now := time.Now()
10
    if now.Sub(rl.lastReset) > time.Second {
11
        rl.events = 0
12
        rl.lastReset = now
13
    }
14

15
    if rl.events < rl.limit {
16
        rl.events++
17
        return true
18
    }
19
    return false
20
}

Performance Measurement Tool: bpftop#

Netflix developed bpftop to measure eBPF program overhead in real-time:

1
# Example bpftop output showing scheduler monitor overhead
2
$ sudo bpftop
3
PID    COMM             TYPE        PROG              RUNTIME(us)  EVENTS   AVG_RUNTIME(ns)
4
12345  scheduler_mon    tracepoint  handle_sched_*    150.2        1,234    121.7

Key metrics tracked:

Runtime per event: <600ns per scheduler hook
Total CPU usage: <0.1% system-wide
Memory footprint: 2-4MB for maps and buffers

Advanced Noisy Neighbor Detection#

Multi-Dimensional Analysis#

Netflix’s system performs sophisticated analysis by correlating multiple signals:

1
// Comprehensive noisy neighbor scoring
2
type NoisyNeighborScore struct {
3
    ContainerID        string
4
    LatencyScore       float64  // Based on run queue latency
5
    PreemptionScore    float64  // Based on preemption frequency
6
    ThrottlingScore    float64  // Based on CPU quota hits
7
    ResourceScore      float64  // Based on resource consumption
8
    OverallScore       float64  // Weighted combination
9
    Confidence         float64  // Statistical confidence
10
}
11

12
func (detector *NoisyNeighborDetector) CalculateScore(containerID string,
13
    metrics *ContainerMetrics) *NoisyNeighborScore {
14

15
    score := &NoisyNeighborScore{ContainerID: containerID}
16

17
    // Latency scoring (0-100 scale)
18
    if metrics.LatencyP99 > detector.thresholds.LatencyHigh {
19
        score.LatencyScore = 100
20
    } else if metrics.LatencyP99 > detector.thresholds.LatencyMedium {
21
        score.LatencyScore = 50
22
    } else {
23
        score.LatencyScore = 0
24
    }
25

26
    // Preemption frequency scoring
27
    score.PreemptionScore = math.Min(metrics.PreemptionRate * 10, 100)
28

29
    // Resource consumption scoring
30
    score.ResourceScore = math.Min(metrics.CPUUsage * 100, 100)
31

32
    // Combine scores with weights
33
    score.OverallScore = (score.LatencyScore * 0.4) +
34
                        (score.PreemptionScore * 0.3) +
35
                        (score.ResourceScore * 0.3)
36

37
    // Calculate confidence based on data quality
38
    score.Confidence = detector.calculateConfidence(metrics)
39

40
    return score
41
}

Temporal Pattern Analysis#

1
// Detect patterns over time to reduce false positives
2
type TemporalAnalyzer struct {
3
    windowSize    time.Duration
4
    patterns      map[string]*PatternHistory
5
}
6

7
type PatternHistory struct {
8
    Timestamps    []time.Time
9
    Scores        []float64
10
    TrendSlope    float64
11
    Seasonality   map[time.Duration]float64
12
}
13

14
func (ta *TemporalAnalyzer) AnalyzePattern(containerID string,
15
    score *NoisyNeighborScore) bool {
16

17
    history := ta.patterns[containerID]
18
    if history == nil {
19
        history = &PatternHistory{}
20
        ta.patterns[containerID] = history
21
    }
22

23
    // Add current data point
24
    history.Timestamps = append(history.Timestamps, time.Now())
25
    history.Scores = append(history.Scores, score.OverallScore)
26

27
    // Keep only recent history
28
    ta.pruneOldData(history)
29

30
    // Calculate trend
31
    history.TrendSlope = ta.calculateTrend(history.Scores)
32

33
    // Detect if this is a sustained pattern vs. temporary spike
34
    return ta.isSustainedPattern(history)
35
}

Alert Generation and Response#

1
// Intelligent alerting system
2
type AlertManager struct {
3
    alertThresholds map[string]AlertThreshold
4
    cooldownPeriods map[string]time.Time
5
    escalationRules []EscalationRule
6
}
7

8
type AlertThreshold struct {
9
    ScoreThreshold    float64
10
    ConfidenceMin     float64
11
    SustainedDuration time.Duration
12
}
13

14
type EscalationRule struct {
15
    Condition    func(*NoisyNeighborScore) bool
16
    Action       string
17
    Severity     AlertSeverity
18
    Targets      []string
19
}
20

21
func (am *AlertManager) ProcessAlert(score *NoisyNeighborScore,
22
    pattern *PatternHistory) {
23

24
    // Check if we're in cooldown period
25
    if lastAlert, exists := am.cooldownPeriods[score.ContainerID]; exists {
26
        if time.Since(lastAlert) < 5*time.Minute {
27
            return // Skip due to cooldown
28
        }
29
    }
30

31
    // Determine alert severity
32
    severity := am.calculateSeverity(score, pattern)
33

34
    // Generate alert based on severity
35
    alert := &Alert{
36
        ContainerID:   score.ContainerID,
37
        Severity:      severity,
38
        Score:         score,
39
        Pattern:       pattern,
40
        Timestamp:     time.Now(),
41
        Recommendations: am.generateRecommendations(score),
42
    }
43

44
    // Send alert through appropriate channels
45
    am.sendAlert(alert)
46

47
    // Update cooldown
48
    am.cooldownPeriods[score.ContainerID] = time.Now()
49
}
50

51
func (am *AlertManager) generateRecommendations(score *NoisyNeighborScore) []string {
52
    var recommendations []string
53

54
    if score.LatencyScore > 70 {
55
        recommendations = append(recommendations,
56
            "Consider increasing CPU limits or moving to dedicated nodes")
57
    }
58

59
    if score.PreemptionScore > 80 {
60
        recommendations = append(recommendations,
61
            "Investigate high-priority processes causing excessive preemption")
62
    }
63

64
    if score.ResourceScore > 90 {
65
        recommendations = append(recommendations,
66
            "Container may need resource limit adjustment or optimization")
67
    }
68

69
    return recommendations
70
}

Integration with Netflix Infrastructure#

Atlas Metrics Integration#

1
// Atlas metrics client for Netflix
2
type AtlasMetricsClient struct {
3
    baseURL    string
4
    apiKey     string
5
    httpClient *http.Client
6
}
7

8
func (client *AtlasMetricsClient) SendMetrics(metrics []Metric) error {
9
    payload := AtlasPayload{
10
        Metrics:   metrics,
11
        Timestamp: time.Now().Unix(),
12
        Source:    "ebpf-scheduler-monitor",
13
    }
14

15
    jsonData, err := json.Marshal(payload)
16
    if err != nil {
17
        return fmt.Errorf("marshaling metrics: %w", err)
18
    }
19

20
    req, err := http.NewRequest("POST", client.baseURL+"/api/v1/metrics",
21
                               bytes.NewBuffer(jsonData))
22
    if err != nil {
23
        return fmt.Errorf("creating request: %w", err)
24
    }
25

26
    req.Header.Set("Content-Type", "application/json")
27
    req.Header.Set("Authorization", "Bearer "+client.apiKey)
28

29
    resp, err := client.httpClient.Do(req)
30
    if err != nil {
31
        return fmt.Errorf("sending request: %w", err)
32
    }
33
    defer resp.Body.Close()
34

35
    if resp.StatusCode != http.StatusOK {
36
        return fmt.Errorf("atlas API error: %d", resp.StatusCode)
37
    }
38

39
    return nil
40
}

Kubernetes Integration#

1
# Netflix scheduler monitor deployment
2
apiVersion: apps/v1
3
kind: DaemonSet
4
metadata:
5
  name: netflix-scheduler-monitor
6
  namespace: monitoring
7
spec:
8
  selector:
9
    matchLabels:
10
      app: scheduler-monitor
11
  template:
12
    metadata:
13
      labels:
14
        app: scheduler-monitor
15
    spec:
16
      hostNetwork: true
17
      hostPID: true
18
      serviceAccountName: scheduler-monitor
19
      containers:
20
        - name: monitor
21
          image: netflix/scheduler-monitor:latest
22
          securityContext:
23
            privileged: true
24
            capabilities:
25
              add: ["SYS_ADMIN", "BPF"]
26
          env:
27
            - name: ATLAS_ENDPOINT
28
              valueFrom:
29
                secretKeyRef:
30
                  name: atlas-credentials
31
                  key: endpoint
32
            - name: ATLAS_API_KEY
33
              valueFrom:
34
                secretKeyRef:
35
                  name: atlas-credentials
36
                  key: api-key
37
            - name: NODE_NAME
38
              valueFrom:
39
                fieldRef:
40
                  fieldPath: spec.nodeName
41
          resources:
42
            requests:
43
              cpu: 50m
44
              memory: 64Mi
45
            limits:
46
              cpu: 200m
47
              memory: 256Mi
48
          volumeMounts:
49
            - name: debugfs
50
              mountPath: /sys/kernel/debug
51
            - name: tracefs
52
              mountPath: /sys/kernel/tracing
53
            - name: bpf-maps
54
              mountPath: /sys/fs/bpf
55
      volumes:
56
        - name: debugfs
57
          hostPath:
58
            path: /sys/kernel/debug
59
        - name: tracefs
60
          hostPath:
61
            path: /sys/kernel/tracing
62
        - name: bpf-maps
63
          hostPath:
64
            path: /sys/fs/bpf
65
      tolerations:
66
        - operator: Exists
67
          effect: NoSchedule

Production Results and Impact#

Performance Improvements#

1
graph TB
2
    subgraph "Before eBPF Monitoring"
3
        B1[Reactive Problem Detection] --> B2[Manual Investigation]
4
        B2 --> B3[perf Tool Deployment]
5
        B3 --> B4[High Overhead Analysis]
6
        B4 --> B5[Post-Incident Resolution]
7

8
        style B1 fill:#ffcdd2
9
        style B4 fill:#ffcdd2
10
        style B5 fill:#ffcdd2
11
    end
12

13
    subgraph "After eBPF Monitoring"
14
        A1[Proactive Detection] --> A2[Automated Analysis]
15
        A2 --> A3[Real-time Insights]
16
        A3 --> A4[Low Overhead Monitoring]
17
        A4 --> A5[Preventive Actions]
18

19
        style A1 fill:#c8e6c9
20
        style A3 fill:#c8e6c9
21
        style A4 fill:#c8e6c9
22
        style A5 fill:#c8e6c9
23
    end

Key Metrics and Improvements#

Metric	Before eBPF	After eBPF	Improvement
Detection Time	Hours-Days	Seconds-Minutes	100-1000x faster
Analysis Overhead	5-15% CPU	<0.1% CPU	50-150x reduction
Coverage	Reactive only	Continuous	24/7 monitoring
False Positives	High	Low	Context-aware filtering
Resolution Time	Hours	Minutes	10-30x faster

Business Impact#

Improved SLA Performance: Reduced latency spikes by 40%
Operational Efficiency: 75% reduction in manual investigation time
Infrastructure Optimization: Better resource allocation decisions
Cost Savings: Reduced over-provisioning through accurate capacity planning

Conclusion#

Netflix’s eBPF-based infrastructure observability represents a paradigm shift in how large-scale systems approach performance monitoring and noisy neighbor detection.

Key Innovations#

Continuous Monitoring: 24/7 observability without reactive deployment
Minimal Overhead: <0.1% CPU impact enables production deployment
Context-Aware Analysis: Distinguishes between quota limits and external interference
Real-Time Detection: Immediate identification of performance issues
Scalable Architecture: Handles Netflix’s massive multi-tenant infrastructure

Strategic Advantages#

Proactive Problem Resolution: Address issues before user impact
Data-Driven Optimization: Make informed infrastructure decisions
Operational Excellence: Reduce manual investigation and response time
Cost Efficiency: Optimize resource allocation and reduce waste

Future Implications#

This approach demonstrates the transformative potential of eBPF for:

Enterprise Monitoring: Extending beyond Netflix to other large-scale deployments
Cloud Provider Services: Enhanced multi-tenant isolation and monitoring
Container Orchestration: Better Kubernetes and container performance insights
Performance Engineering: New methodologies for system optimization

Netflix’s success with eBPF infrastructure observability provides a blueprint for organizations seeking to achieve similar levels of operational excellence and performance optimization in their own multi-tenant environments.