The single biggest performance lesson I ever learned is that modern CPUs are not slow at computing, they are slow at waiting for memory. A cache miss to main memory can cost a couple hundred cycles. In that time the CPU could have done hundreds of additions. Once you internalize that gap, you start designing programs around how data moves, not just what operations run on it. That is the heart of data-oriented design.
The memory hierarchy is the real machine
Between the CPU and main memory sit several layers of cache: L1 is tiny and nearly as fast as registers, L2 is bigger and slower, L3 bigger and slower still. When the CPU needs a byte, it does not fetch one byte. It fetches a whole cache line, typically 64 bytes, and stores it in cache. If your next access is within that line, it is nearly free. If it is somewhere far away, you pay the full miss penalty.
This means the layout of your data in memory directly controls your performance. Two programs doing the identical arithmetic can differ by an order of magnitude based purely on access patterns.
Arrays of structs versus structs of arrays
The classic example is how you store a collection of records. The intuitive object-oriented layout is an array of structs:
struct Particle {
float x, y, z; // position
float vx, vy, vz; // velocity
float mass;
char name[32];
};
struct Particle particles[100000];
// update positions
for (int i = 0; i < 100000; i++) {
particles[i].x += particles[i].vx;
}
This loop only touches x and vx, but every cache line it loads is full of mass, name, and the other fields you do not need. You are dragging cold data through cache for nothing. The data-oriented layout splits the fields into parallel arrays, a struct of arrays:
struct Particles {
float x[100000], y[100000], z[100000];
float vx[100000], vy[100000], vz[100000];
float mass[100000];
};
for (int i = 0; i < 100000; i++) {
p.x[i] += p.vx[i];
}
Now the x array and vx array are each densely packed. Every byte you pull into cache is a byte you use. On real hardware this kind of change routinely gives 3x to 10x speedups on hot loops, with no algorithmic change at all.
Why pointer chasing hurts
Linked lists and node-based trees are the opposite of cache-friendly. Each node is a separate heap allocation that can live anywhere, so traversing the list jumps all over memory and misses the cache on nearly every step. If you understand stack vs heap memory and how heap allocations scatter, this follows naturally. A flat array you scan linearly is dramatically faster than a linked list with the same number of elements, even though both are O(n), because the hardware prefetcher can predict and preload sequential access.
- Prefer contiguous arrays over node-based structures when you iterate often.
- Group fields you access together, and separate the ones you do not.
- Process data in the order it sits in memory whenever you can.
- Keep hot data small so more of it fits in cache at once.
It is the same idea as a good allocator
This is why allocation strategy matters so much. If you allocate ten thousand objects one at a time, they end up scattered. If you allocate them in one block, they sit together and iterate fast. The custom allocator in writing a simple memory allocator gives you exactly this control: you decide the layout instead of leaving it to chance.
False sharing, the multicore trap
There is a nastier cache effect that shows up with threads. If two cores each write to different variables that happen to live in the same 64 byte cache line, the hardware has to keep bouncing that line between their caches even though the threads never touch the same bytes. This is called false sharing, and it can quietly destroy the scaling of an otherwise parallel program. The fix is to pad or align the per-thread data so each thread's hot variables sit on their own cache line. I have seen a loop go from no speedup on eight cores to near linear scaling with nothing more than a padding field added to a struct.
When to reach for this
I do not restructure everything around the cache. For code that runs once or rarely, clarity wins. But for the hot loops, the inner kernels that run millions of times, data layout is usually the first thing I tune and often the highest return. Profile first, find where the cache misses are, then lay the data out the way the hardware wants to read it. The machine is happy to be fast if you stop making it wait.