The scheduler's job
Every running Linux system has more threads that want CPU time than it has CPU cores to run them. The scheduler decides which threads get to run, for how long, and on which cores. It needs to balance fairness (every thread gets its share), throughput (maximize useful work), and latency (interactive tasks respond quickly).
Completely Fair Scheduler
CFS is the default scheduler for normal processes in Linux. It doesn't use fixed time slices like older schedulers. Instead, it tracks how much CPU time each runnable task has consumed and always picks the task with the least accumulated runtime.
The core data structure is a red-black tree ordered by vruntime (virtual runtime). The leftmost node has the smallest vruntime, meaning it's been the most "underserved" and should run next.
struct sched_entity {
u64 vruntime;
struct rb_node run_node;
// ...
};
// Simplified pick_next_entity
struct sched_entity *pick_next(struct cfs_rq *rq) {
struct rb_node *left = rb_first_cached(&rq->tasks_timeline);
return rb_entry(left, struct sched_entity, run_node);
}When a task runs, its vruntime increases at a rate proportional to wall clock time. When CFS needs to pick the next task, it takes the one with the lowest vruntime. Over time, all tasks converge to roughly equal vruntime, which means they all got roughly equal CPU time. That's the "fair" in Completely Fair Scheduler.
Nice values
Nice values range from -20 (highest priority) to 19 (lowest priority), with 0 as the default. They keep the process on the same scheduler and change the rate at which vruntime accumulates.
A process with nice 0 accumulates vruntime at 1x speed. A process with nice -10 accumulates at roughly 0.1x speed, meaning it can run 10x longer before its vruntime catches up with others. A process with nice 10 accumulates at roughly 10x speed, so it quickly "falls behind" and yields the CPU.
# Run a CPU-intensive task at lower priority
nice -n 10 cargo build --release
# Increase priority of an existing process (requires root)
renice -n -5 -p $(pidof my_server)| Nice value | Weight | Relative CPU share |
|---|---|---|
| -20 | 88761 | ~464x compared with nice 19 |
| -10 | 9548 | ~50x compared with nice 19 |
| 0 | 1024 | ~5.3x compared with nice 19 |
| 10 | 110 | ~0.57x compared with nice 0 |
| 19 | 15 | Minimum share |
Real-time scheduling
For processes that genuinely can't tolerate scheduling delays, Linux provides real-time schedulers:
SCHED_FIFO: Run until the task voluntarily yields or a higher-priority SCHED_FIFO task becomes runnable. No time slicing. A SCHED_FIFO task at priority 99 will starve everything else on that core.
SCHED_RR: Round-robin among tasks at the same priority level. Each task gets a time slice (default 100ms), then yields to the next same-priority task.
Both real-time schedulers always preempt CFS tasks. A SCHED_FIFO task at priority 1 (lowest RT priority) still runs before any normal CFS task.
struct sched_param param;
param.sched_priority = 50;
sched_setscheduler(0, SCHED_FIFO, ¶m);Use real-time scheduling for:
- Audio processing (JACK, PulseAudio RT threads)
- Industrial control systems
- High-frequency trading execution threads
- Custom kernel bypass networking
Avoid real-time scheduling for general application code. A buggy SCHED_FIFO task that spins can lock up an entire CPU core and make the system unresponsive.
CPU affinity
By default, the scheduler can migrate threads between any available cores. You can pin threads to specific cores using CPU affinity:
# Pin process to cores 0 and 1
taskset -c 0,1 ./my_program
# Check current affinity
taskset -p $(pidof my_program)Pinning is useful when you want to eliminate migration overhead and keep cache contents warm. If a thread always runs on core 2, its working set stays in core 2's L1/L2 caches. Migration to core 5 means starting with cold caches.
Pinning reduces the scheduler's flexibility. If core 2 is overloaded and core 5 is idle, the pinned thread cannot move to the idle core. Use pinning deliberately.
Practical debugging
htop shows per-process CPU usage, nice values, and the core each thread is running on. Scheduler-level debugging usually needs more detail:
# See scheduling policy and priority
chrt -p $(pidof my_program)
# Trace scheduling events in real-time
perf sched record -- sleep 5
perf sched latency
# See context switch counts
cat /proc/$(pidof my_program)/status | grep voluntaryHigh involuntary context switch counts mean the scheduler is preempting your task frequently, probably because it's using too much CPU relative to its priority. High voluntary context switch counts usually mean the task is doing a lot of I/O (which is normal).
