A cache with a 95 percent hit rate sounds excellent. Do the math and it is not quite there yet. If a cache hit takes 1 millisecond and a miss takes 100, a 95 percent hit rate still leaves your average read at 5.95 milliseconds, because the rare misses dominate the total. Push that rate to 99 percent and the average drops to 1.99 milliseconds, roughly three times faster. Caching lives and dies by that last few percent.
This note covers what a cache is, the hit-rate math that decides whether it helps, the memory hierarchy caches live in, and which data is safe to put in one.
What problem does caching solve?
A cache is a small, fast store that holds copies of data close to where it is needed, so repeat reads skip the slow or expensive source behind it. The source might be a disk, a database, a remote API, or a function that takes real CPU to run. The cache trades a little memory and a little staleness for a large drop in latency and load.
The naive approach is to send every read to the source. That works until the source is slow, costly, or capped on throughput, and the same data gets read over and over. A product page that runs the identical database query for every visitor does the same work thousands of times a second, for an answer that barely changes.
Caching works because real workloads are lopsided. A small set of hot items takes most of the requests, and any given item is likely to be read again soon. Web traffic in particular follows a Zipf-like popularity curve. Breslau and colleagues measured this across six proxy traces in 1999 and found that the hit ratio a cache can reach grows only logarithmically with its size. The practical reading is that a cache far smaller than the full dataset can still absorb the majority of the reads.
How does a cache actually work?
A read checks the cache first: a hit returns the copy immediately, and a miss falls through to the source, returns that result, and usually stores it for next time.
The read path has four steps:
- A request arrives for a key.
- The cache is checked. If the key is present and still fresh, that is a hit, and the value returns right away.
- If the key is absent or expired, that is a miss. The request fetches from the source.
- The result is stored in the cache with a freshness limit, then returned. The next read of that key is a hit.
The number that matters is the hit ratio, the share of reads served from the cache:
hit ratio = hits / (hits + misses)
avg latency = hit_ratio x cache_latency + (1 - hit_ratio) x source_latency
That second line is why the gap between a good and a great hit rate is so large. With a 1 ms cache in front of a 100 ms source, 90 percent gives 10.9 ms, 95 percent gives 5.95 ms, and 99 percent gives 1.99 ms. Every step closer to 100 percent strips out misses, and the misses are where all the time goes.
Where do caches live?
Caches sit at every layer of a system, each one larger and slower than the one above it. Your CPU keeps L1, L2, and L3 caches in front of main memory, working on 64-byte lines. A database keeps a buffer pool of hot pages in RAM so most reads never touch disk. Applications add an in-memory cache such as Redis or Memcached. A CDN caches responses at edge locations near users. The browser keeps its own private cache.
The latency gaps explain the layering. An L1 hit is well under a nanosecond, main memory is around 100 nanoseconds, a same-datacenter round trip is hundreds of microseconds, and a cross-continent round trip is over 100 milliseconds. These ladder figures, popularized by Jeff Dean from earlier numbers by Peter Norvig, are approximate orders of magnitude, but the lesson holds: every layer you avoid is worth roughly ten to a hundred times the latency.
| Cache layer | Typical hit latency | Shared by | Example |
|---|---|---|---|
| CPU caches (L1 to L3) | ~1 to 20 ns | One core or socket | 64-byte cache lines |
| Database buffer pool | ~100 ns in RAM vs ~10 ms disk | One database node | Postgres shared_buffers, InnoDB buffer pool |
| In-memory data cache | well under 1 ms in-datacenter | Many services | Redis, Memcached |
| CDN edge cache | tens of ms, near the user | Users in a region | Cloudflare, Fastly |
| Browser private cache | local, near-zero network | One user | HTTP Cache-Control |
When should you not cache something?
Skip the cache when data must always be current, when it changes about as often as it is read, or when it is unique to one request, because the cache will either lie or never get reused.
- Reads that must be correct. Account balances, inventory at checkout, and authorization decisions cannot serve a stale value. Read from the source of truth, or pair the cache with explicit invalidation and accept the extra complexity.
- Write-heavy or rarely reused data. If each item is read once or twice before it changes, the cache pays storage and invalidation cost for almost no hits. Drop the cache and size the source for the load instead.
- Private data in a shared cache. Putting a personalized response in a CDN or proxy can hand one user's data to another. RFC 9111 reserves that case for a private cache, so mark shared responses
publicand personalized onesprivate. - A single hot key, not a broad key space. One viral item still lands on one cache entry on one node. Replicate the hot key across nodes or coalesce duplicate requests rather than caching harder.
Two failure modes bite even when caching is the right call. Stale reads after the source changes, which is the cache invalidation problem, and the cache stampede, where a popular key expires and a flood of requests hits the source at once. Both have standard fixes, and each gets its own note.
How do real systems use caching?
Every layer of production infrastructure runs a cache, usually several.
- CPU caches. Every modern processor fronts main memory with L1, L2, and L3 caches on 64-byte lines, each level larger and slower than the last. This is the textbook memory hierarchy from Hennessy and Patterson, and it is why data layout changes how fast code runs.
- Redis and Memcached. The two default in-memory caches. Redis says it can be used "as a database, cache, streaming engine, message broker, and more," and exposes
keyspace_hitsandkeyspace_missescounters so you can watch the ratio. Memcached calls itself a "high-performance, distributed memory object caching system." - Meta's memcache. The "Scaling Memcache at Facebook" paper (Nishtala and colleagues, USENIX NSDI 2013) describes a memcached-based key-value layer that "handles billions of requests per second and holds trillions of items." It is the canonical write-up of caching at scale.
- CDNs. Cloudflare and Fastly cache responses at edge locations near users and label each one HIT or MISS. Fastly's own docs tell operators to "aim for a 90%+ cache hit ratio," all driven by the HTTP
Cache-Controlrules in RFC 9111. - Database buffer pools. Databases cache hot pages in RAM so most reads never reach disk. PostgreSQL's
shared_buffersdefaults to 128 MB, with a common starting point of 25 percent of system RAM, and MySQL's InnoDB buffer pool also defaults to 128 MB and reports a "Buffer pool hit rate" you can monitor.
For the primary sources, see Web Caching and Zipf-like Distributions, Scaling Memcache at Facebook, RFC 9111 HTTP Caching, and the PostgreSQL shared_buffers docs.
Keep Reading
- Consistent Hashing. How a distributed cache spreads keys across many nodes so adding one server does not flush everything.
- The One System Design Question That Failed 80% of Candidates in 2025. Caching decisions are exactly where that interview answer is won or lost.
- Why 1 == 1 is True but 128 == 128 is False in Java. The JVM's Integer cache is the same hit and miss idea shrunk down to the language runtime.