Latency Engineering for Trading Systems

Why microseconds matter

In high-frequency trading, the difference between filling an order and missing it can be a single microsecond. If two firms send identical orders, the one that arrives first gets the fill. Over millions of trades, consistent latency advantages compound into significant returns.

The typical latency budget for a competitive HFT system:

Component	Target latency
Market data parsing	< 1 us
Signal computation	< 2 us
Order generation	< 0.5 us
Network to exchange	5-50 us (distance)
Total tick-to-trade	< 10 us (same colo)

Everything from the operating system to the hardware is optimized to hit these numbers.

Kernel bypass networking

The Linux kernel's network stack adds 5-20 microseconds of latency per packet. Normal applications rarely notice this. Trading systems often do.

Kernel bypass frameworks (DPDK, Solarflare OpenOnload, Mellanox VMA) let your application talk directly to the network card without going through the kernel. The NIC's receive ring buffer is mapped into your process's address space. When a packet arrives, your code reads it directly from the ring buffer without a system call, context switch, or interrupt.

// Simplified DPDK-style receive loop
while (true) {
    uint16_t nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE);
    for (uint16_t i = 0; i < nb_rx; i++) {
        process_packet(pkts[i]);
        rte_pktmbuf_free(pkts[i]);
    }
}

This loop polls continuously. There are no interrupts, sleeps, or context switches. The dedicated CPU core runs at 100% utilization even when idle, and the latency from packet arrival to processing start can stay under 1 microsecond.

Lock-free data structures

When multiple threads need to communicate, locks add too much latency. A mutex acquisition involves an atomic operation that can take 25+ nanoseconds when uncontested and microseconds when contested.

Lock-free SPSC (single-producer, single-consumer) queues use atomic operations and memory ordering to pass data between exactly two threads without any locks:

template<typename T, size_t N>
struct SPSCQueue {
    alignas(64) std::atomic<size_t> head{0};
    alignas(64) std::atomic<size_t> tail{0};
    alignas(64) T buffer[N];

    bool push(const T& item) {
        size_t h = head.load(std::memory_order_relaxed);
        size_t next = (h + 1) % N;
        if (next == tail.load(std::memory_order_acquire))
            return false;
        buffer[h] = item;
        head.store(next, std::memory_order_release);
        return true;
    }

    bool pop(T& item) {
        size_t t = tail.load(std::memory_order_relaxed);
        if (t == head.load(std::memory_order_acquire))
            return false;
        item = buffer[t];
        tail.store((t + 1) % N, std::memory_order_release);
        return true;
    }
};

The alignas(64) prevents false sharing by ensuring head and tail live on different cache lines. If they shared a cache line, writing to head on one core would invalidate the cache line containing tail on the other core, causing unnecessary cache coherence traffic.

Memory pre-allocation

Dynamic memory allocation (malloc, new, Vec::push) can trigger system calls, page faults, or garbage collection. Hot paths avoid these costs.

Trading systems pre-allocate all memory at startup:

Object pools for orders, fills, and market data messages
Ring buffers for inter-thread communication
Memory-mapped huge pages to avoid TLB misses

struct OrderPool {
    storage: Vec<Order>,
    free_list: Vec<usize>,
}

impl OrderPool {
    fn new(capacity: usize) -> Self {
        Self {
            storage: Vec::with_capacity(capacity),
            free_list: (0..capacity).rev().collect(),
        }
    }

    fn allocate(&mut self) -> &mut Order {
        let idx = self.free_list.pop().expect("pool exhausted");
        &mut self.storage[idx]
    }

    fn deallocate(&mut self, idx: usize) {
        self.free_list.push(idx);
    }
}

CPU isolation

On a server with 32 cores, you want your hot-path threads to have exclusive access to specific cores. No kernel threads, no interrupts, no other processes sharing the same core.

# Boot parameter: isolate cores 16-23 from the scheduler
GRUB_CMDLINE_LINUX="isolcpus=16-23 nohz_full=16-23 rcu_nocbs=16-23"

# Move all interrupts to core 0
for irq in /proc/irq/*/smp_affinity; do
    echo 1 > $irq
done

# Pin trading thread to isolated core
taskset -c 16 ./trading_engine

nohz_full stops the kernel timer tick on isolated cores. rcu_nocbs moves RCU callback processing off those cores. Combined with isolcpus, the isolated core runs your code with virtually zero kernel interference.

The full picture

A low-latency trading system stack often includes:

Kernel bypass NIC with hardware timestamping
Dedicated, isolated CPU cores for hot path threads
Lock-free SPSC queues between pipeline stages
Pre-allocated object pools, zero malloc on hot path
Huge pages for TLB efficiency
FPGA or custom hardware for the most latency-critical operations

Most of these techniques apply beyond trading. Any system that needs deterministic, low-latency processing benefits from reducing kernel crossings, locks, allocation, and shared cache-line traffic.

A broad office scene filled with workstations and screens — Markets are systems made of timing, structure, and pressure.