System Design Notes
CACHING & LOAD DISTRIBUTIONBEGINNER
Last updated June 19, 20268 min read

Caching Fundamentals

Caching keeps copies of hot data in a small, fast layer so reads skip the slow source. The hit-ratio math, the cache hierarchy, and what is safe to cache.

cachingcache-hit-ratiocdnredisperformance

A cache with a 95 percent hit rate sounds excellent. Do the math and it is not quite there yet. If a cache hit takes 1 millisecond and a miss takes 100, a 95 percent hit rate still leaves your average read at 5.95 milliseconds, because the rare misses dominate the total. Push that rate to 99 percent and the average drops to 1.99 milliseconds, roughly three times faster. Caching lives and dies by that last few percent.

This note covers what a cache is, the hit-rate math that decides whether it helps, the memory hierarchy caches live in, and which data is safe to put in one.

What problem does caching solve?

A cache is a small, fast store that holds copies of data close to where it is needed, so repeat reads skip the slow or expensive source behind it. The source might be a disk, a database, a remote API, or a function that takes real CPU to run. The cache trades a little memory and a little staleness for a large drop in latency and load.

The naive approach is to send every read to the source. That works until the source is slow, costly, or capped on throughput, and the same data gets read over and over. A product page that runs the identical database query for every visitor does the same work thousands of times a second, for an answer that barely changes.

Caching works because real workloads are lopsided. A small set of hot items takes most of the requests, and any given item is likely to be read again soon. Web traffic in particular follows a Zipf-like popularity curve. Breslau and colleagues measured this across six proxy traces in 1999 and found that the hit ratio a cache can reach grows only logarithmically with its size. The practical reading is that a cache far smaller than the full dataset can still absorb the majority of the reads.

How does a cache actually work?

A read checks the cache first: a hit returns the copy immediately, and a miss falls through to the source, returns that result, and usually stores it for next time.

The read path has four steps:

  1. A request arrives for a key.
  2. The cache is checked. If the key is present and still fresh, that is a hit, and the value returns right away.
  3. If the key is absent or expired, that is a miss. The request fetches from the source.
  4. The result is stored in the cache with a freshness limit, then returned. The next read of that key is a hit.

The number that matters is the hit ratio, the share of reads served from the cache:

hit ratio    = hits / (hits + misses)
avg latency  = hit_ratio x cache_latency + (1 - hit_ratio) x source_latency

That second line is why the gap between a good and a great hit rate is so large. With a 1 ms cache in front of a 100 ms source, 90 percent gives 10.9 ms, 95 percent gives 5.95 ms, and 99 percent gives 1.99 ms. Every step closer to 100 percent strips out misses, and the misses are where all the time goes.

Where do caches live?

Caches sit at every layer of a system, each one larger and slower than the one above it. Your CPU keeps L1, L2, and L3 caches in front of main memory, working on 64-byte lines. A database keeps a buffer pool of hot pages in RAM so most reads never touch disk. Applications add an in-memory cache such as Redis or Memcached. A CDN caches responses at edge locations near users. The browser keeps its own private cache.

The latency gaps explain the layering. An L1 hit is well under a nanosecond, main memory is around 100 nanoseconds, a same-datacenter round trip is hundreds of microseconds, and a cross-continent round trip is over 100 milliseconds. These ladder figures, popularized by Jeff Dean from earlier numbers by Peter Norvig, are approximate orders of magnitude, but the lesson holds: every layer you avoid is worth roughly ten to a hundred times the latency.

Cache layerTypical hit latencyShared byExample
CPU caches (L1 to L3)~1 to 20 nsOne core or socket64-byte cache lines
Database buffer pool~100 ns in RAM vs ~10 ms diskOne database nodePostgres shared_buffers, InnoDB buffer pool
In-memory data cachewell under 1 ms in-datacenterMany servicesRedis, Memcached
CDN edge cachetens of ms, near the userUsers in a regionCloudflare, Fastly
Browser private cachelocal, near-zero networkOne userHTTP Cache-Control

When should you not cache something?

Skip the cache when data must always be current, when it changes about as often as it is read, or when it is unique to one request, because the cache will either lie or never get reused.

  • Reads that must be correct. Account balances, inventory at checkout, and authorization decisions cannot serve a stale value. Read from the source of truth, or pair the cache with explicit invalidation and accept the extra complexity.
  • Write-heavy or rarely reused data. If each item is read once or twice before it changes, the cache pays storage and invalidation cost for almost no hits. Drop the cache and size the source for the load instead.
  • Private data in a shared cache. Putting a personalized response in a CDN or proxy can hand one user's data to another. RFC 9111 reserves that case for a private cache, so mark shared responses public and personalized ones private.
  • A single hot key, not a broad key space. One viral item still lands on one cache entry on one node. Replicate the hot key across nodes or coalesce duplicate requests rather than caching harder.

Two failure modes bite even when caching is the right call. Stale reads after the source changes, which is the cache invalidation problem, and the cache stampede, where a popular key expires and a flood of requests hits the source at once. Both have standard fixes, and each gets its own note.

How do real systems use caching?

Every layer of production infrastructure runs a cache, usually several.

  • CPU caches. Every modern processor fronts main memory with L1, L2, and L3 caches on 64-byte lines, each level larger and slower than the last. This is the textbook memory hierarchy from Hennessy and Patterson, and it is why data layout changes how fast code runs.
  • Redis and Memcached. The two default in-memory caches. Redis says it can be used "as a database, cache, streaming engine, message broker, and more," and exposes keyspace_hits and keyspace_misses counters so you can watch the ratio. Memcached calls itself a "high-performance, distributed memory object caching system."
  • Meta's memcache. The "Scaling Memcache at Facebook" paper (Nishtala and colleagues, USENIX NSDI 2013) describes a memcached-based key-value layer that "handles billions of requests per second and holds trillions of items." It is the canonical write-up of caching at scale.
  • CDNs. Cloudflare and Fastly cache responses at edge locations near users and label each one HIT or MISS. Fastly's own docs tell operators to "aim for a 90%+ cache hit ratio," all driven by the HTTP Cache-Control rules in RFC 9111.
  • Database buffer pools. Databases cache hot pages in RAM so most reads never reach disk. PostgreSQL's shared_buffers defaults to 128 MB, with a common starting point of 25 percent of system RAM, and MySQL's InnoDB buffer pool also defaults to 128 MB and reports a "Buffer pool hit rate" you can monitor.

For the primary sources, see Web Caching and Zipf-like Distributions, Scaling Memcache at Facebook, RFC 9111 HTTP Caching, and the PostgreSQL shared_buffers docs.

Keep Reading

Frequently Asked Questions

What is a cache hit ratio?

A cache hit ratio is the share of read requests served from the cache instead of the source behind it, calculated as hits divided by hits plus misses. It is the main health metric for any cache, because average latency is dominated by the misses. Raising the ratio from 95 to 99 percent can cut average read latency by about three times.

Why does caching work if the cache only holds a fraction of the data?

Because real workloads are skewed. A small set of hot items receives most of the requests, and web popularity in particular follows a Zipf-like distribution, so a cache far smaller than the full dataset can still serve the majority of reads. Measured studies show the achievable hit ratio grows logarithmically with cache size.

What kinds of data should you not cache?

Avoid caching data that must always be current, such as account balances or inventory at checkout, data written far more often than it is read, and per-user private data placed inside a shared cache like a CDN. In those cases a stale or leaked answer costs more than the latency you save.

What is cache invalidation and why is it hard?

Cache invalidation is removing or refreshing a cached copy once the underlying data changes, so readers do not see stale values. It is hard because the cache and the source change independently and the right moment to invalidate is rarely obvious. It is famously called one of the two hard problems in computer science.

First published: June 19, 2026 · Last updated: June 19, 2026

Rabinarayan Patra - Software Development Engineer

Rabinarayan Patra

SDE II at Amazon. Previously at ThoughtClan Technologies building systems that processed 700M+ daily transactions. I write about Java, Spring Boot, microservices, and the things I figure out along the way. More about me →

X (Twitter)LinkedIn

Stay in the loop

Get the latest articles on system design, frontend and backend development, and emerging tech trends, straight to your inbox. No spam.