Search Algorithms Through the CPU Cache

The textbook version

You learn binary search pretty early. Sorted array, O(log n) lookups, faster than O(n) linear scan. That is a useful model for large enough arrays. It leaves out one important detail: your CPU reads memory in chunks.

When you write array[i], the code names one element. The CPU loads an entire cache line, usually 64 bytes on x86 processors. With 4-byte integers, that is 16 integers pulled toward L1 cache at once.

This changes the performance story completely.

How the cache hierarchy works

Your CPU has multiple levels of cache between it and main memory. On a modern processor, the typical layout looks something like this:

L1 Cache

32-64 KB

L1 Latency

~1 ns

L2 Cache

256 KB-1 MB

L2 Latency

~3-5 ns

L3 Cache

8-64 MB

L3 Latency

~10-20 ns

Main Memory

16-128 GB

RAM Latency

~60-100 ns

L1 is tiny and extremely fast. L3 is much larger and usually 10-20x slower. Main memory adds another 5-10x of latency. The cache hierarchy exists to exploit temporal locality (recently used data tends to be used again) and spatial locality (nearby data tends to be used soon).

Sequential access patterns play perfectly into spatial locality. You touch element 0, the CPU loads elements 0 through 15 into a cache line. When you touch element 1, it's already in L1. Free. Elements 2 through 15 are also free. You only pay the cost of a cache miss every 16 elements.

Linear search: the sequential access pattern

Here is the instruction-level shape of linear search. The access pattern is simple and steady.

fn linear_search(arr: &[i32], target: i32) -> Option<usize> {
    for (i, &val) in arr.iter().enumerate() {
        if val == target {
            return Some(i);
        }
    }
    None
}

Hit play on this. Watch how the cache fills up predictably as linear search walks through the array left to right.

Linear Search -- looking for 76

Index

Comparisons

Cache Loads

Status

Searching

Array (sequential scan)

L1 Cache (4 lines x 8 elements)

0 / 33

Notice how the cache loads happen in clean, predictable bursts. Every 8 elements, a new cache line gets pulled in. The hardware prefetcher detects the pattern and starts loading the next cache line ahead of time.

For a 32-element array, linear search does 32 comparisons and needs only 4 cache line loads. With prefetching, most of those loads overlap with computation.

Binary search: the random access pattern

Binary search looks great on paper. Each iteration cuts the search space in half.

fn binary_search(arr: &[i32], target: i32) -> Option<usize> {
    let mut lo = 0usize;
    let mut hi = arr.len();

    while lo < hi {
        let mid = lo + (hi - lo) / 2;
        match arr[mid].cmp(&target) {
            std::cmp::Ordering::Equal => return Some(mid),
            std::cmp::Ordering::Less => lo = mid + 1,
            std::cmp::Ordering::Greater => hi = mid,
        }
    }
    None
}

Now watch binary search. Same array, same target. Pay attention to how it jumps around the cache.

Binary Search -- looking for 72

Range

[0..31]

Mid

Comparisons

Cache Loads

Status

Searching

Array (binary search)

L1 Cache (4 lines x 8 elements)

0 / 7

Binary search does far fewer comparisons. Its memory behavior is less friendly. Each comparison accesses a different region of the array: middle, quarter point, eighth point. The hardware prefetcher has little to work with because there is no sequential pattern.

Every single comparison is almost guaranteed to be a cache miss if the array is larger than L1. That means every comparison pays the full latency cost of whichever cache level currently holds the data.

The cost breakdown

Metric	Linear Search	Binary Search
Time complexity	O(n)	O(log n)
Comparisons (32 elements)	32 worst case	5 worst case
Cache line loads	4 (sequential)	5 (random)
Prefetch friendly	Yes	No
Branch prediction	~100% accurate	~50% accurate
Pipeline stalls	Minimal	Frequent

For small arrays, the cache and branch prediction advantages of linear search can outweigh the algorithmic advantage of binary search. The crossover varies by hardware.

The cache line simulator

This is what sequential memory access looks like from the cache's perspective. Each row in the cache represents one cache line. Watch how sequential access achieves near-perfect spatial locality.

Cache Line Simulator -- Sequential Access

Address

Accesses

Hits

Misses

Hit Rate

Main Memory (64 bytes)

L1 Cache (4 lines)

0 / 64

The hit rate climbs quickly because once a cache line is loaded, all 8 elements in it are served from L1 for free. This is why sequential iteration is so fast. You're paying for one cache miss and getting 7 cache hits bundled with it.

Where this matters

For small arrays (say, under a few hundred elements), linear search can genuinely beat binary search in wall-clock time because it's so cache-friendly. The hardware prefetcher keeps L1 warm, the branch predictor is happy because you're doing the same comparison in a tight loop, and you never stall waiting for memory.

Binary search wins once the array is large enough for the O(log n) advantage to dominate the cache penalty. For a million 4-byte integers, binary search does about 20 comparisons. Even if every comparison misses cache, 20 misses at 100ns each is around 2 microseconds. Linear search would touch far more memory.

The crossover point depends on your hardware, element size, and cache hierarchy. It is worth knowing that it exists. O(n) with excellent cache behavior can beat O(log n) with poor cache behavior.

The branch prediction angle

There's another dimension here that people rarely talk about: branch prediction.

Linear search has one branch: if array[i] == target. This branch is "not taken" for almost every iteration and "taken" once when you find the element. The branch predictor learns this pattern immediately and predicts "not taken" every time. It's wrong exactly once at the end. That's about as good as it gets.

Binary search has a different branch pattern: if array[mid] < target, go right, else go left. From the branch predictor's perspective, the direction often looks close to random because it depends on the data. Every misprediction flushes the instruction pipeline, which on a modern out-of-order CPU means throwing away 15-20 cycles of speculative work.

You can see this in C++ with a branchless binary search. The cmov instruction avoids the branch entirely:

size_t branchless_lower_bound(const int* arr, size_t n, int target) {
    size_t lo = 0;
    while (n > 1) {
        size_t half = n / 2;
        // cmov: no branch, no misprediction
        lo += (arr[lo + half] < target) * half;
        n -= half;
    }
    return lo;
}

This version avoids branch misprediction entirely by using conditional moves. The CPU doesn't need to predict anything because there's no branch to predict. On large arrays with random targets, this can be 30-40% faster than the branching version purely from eliminating misprediction penalties.

What to do with this

Binary search is still the right choice for large sorted arrays. When you are performance engineering, Big-O is the starting point of the analysis.

Things that matter in practice:

Array size relative to cache size. If it fits in L1, cache behavior dominates. If it's way bigger than L3, algorithmic complexity dominates.
Access pattern predictability. Sequential patterns get prefetched. Random patterns usually stall.
Branch predictability. Tight loops with predictable branches run faster than code with data-dependent branches.
Element size. Bigger elements mean fewer elements per cache line, which means more cache misses for the same array length.

If search latency matters at the microsecond level, as it can in trading systems or game engine spatial queries, these details are worth measuring. For a list of 50 items in a web app, readability matters more.

The hardware is doing an incredible amount of work to make your code fast. Understanding that work helps you write code that fits the machine.

A source code themed illustration with bright light and floating code — Algorithms are clearer when you can see the shape of the work.