hi i'm prasad firame.

i'm just a guy who builds stuff.

been here since 22.348946165 years.

i love backend engineering, systems, and architectures.

love tinkering around with llms and coming ai stuff.

currently learning about distributed systems and infra.

in my free time, i read and write blogs around architectures and running ai workloads on the cloud/mlops.

Lock-Free LRU Cache nov 2024

← back to blogs

Lock-Free LRU Cache

november 2024

caching is one of those fundamental patterns in software engineering that everyone implements at some point. the concept is simple: keep frequently accessed data close by so you don't have to fetch it from slower storage every time. but as with most things in distributed systems and concurrent programming, the devil is in the details.

the least recently used (lru) cache is a popular eviction policy. when the cache is full and you need to make room for new data, you evict the item that hasn't been accessed in the longest time. it's intuitive and works well in practice for many workloads.

the traditional approach

most lru cache implementations use a combination of a hash map for o(1) lookups and a doubly linked list to track access order. when you access an item, you move it to the front of the list. when you need to evict, you remove from the back.

this works great in single-threaded environments. but what happens when multiple threads are trying to read and write to the cache simultaneously? the naive approach is to wrap everything in a mutex. every operation—reads, writes, updates—requires acquiring a lock.

the problem with locks is contention. when you have high concurrency, threads spend more time waiting for locks than doing useful work. cache hits become cache misses in terms of performance. you've essentially serialized access to what should be a fast data structure.

why go lock-free?

lock-free data structures use atomic operations and compare-and-swap (cas) instructions instead of locks. the key insight is that you can make progress without mutual exclusion. at least one thread will always make progress, even if others are delayed or fail.

for a cache specifically, this means:

• reads don't block writes

• writes don't block reads

• no thread holds a global lock that stops everyone else

the performance characteristics are dramatically different. instead of contention increasing linearly with thread count, you get much better scaling. it's not perfect—cas operations can fail and retry—but it's way better than blocking.

the challenge

implementing a lock-free lru cache is tricky. the traditional doubly-linked list approach doesn't translate well to lock-free because updating both forward and backward pointers atomically is hard. you'd need to cas multiple memory locations at once, which hardware doesn't support.

so you need to rethink the data structure. some approaches use lock-free hash tables with approximate lru tracking. you don't maintain perfect lru order—you maintain good enough order. this is actually fine for most use cases because caching is fundamentally about probabilistic optimization anyway.

another approach is to segment the cache. instead of one global lru list, you have multiple segments, each with their own lru tracking. this reduces contention because threads are less likely to compete for the same segment. it's not fully lock-free, but it's lock-reduced, which can be good enough.

design considerations

when building a lock-free lru cache, you need to think about several things:

memory ordering: lock-free code is sensitive to memory ordering. you need to understand acquire/release semantics and memory barriers. different architectures have different memory models. what works on x86 might not work on arm.

aba problem: this is a classic lock-free gotcha. you read a value, another thread changes it and then changes it back, and your cas succeeds even though the state changed. you need tagged pointers or hazard pointers to handle this.

reclamation: when you remove a node from the cache, you can't just free it immediately because another thread might still be reading it. you need a safe memory reclamation strategy like epoch-based reclamation or hazard pointers.

fairness: lock-free doesn't mean wait-free. some threads might keep retrying while others make progress. you need to think about starvation and fairness.

practical implementation

in practice, i've found that hybrid approaches work well. you can use lock-free techniques for the hot path (reads) and fall back to locks for less frequent operations (eviction, resizing).

for example, you might have a lock-free hash table for lookups, but when you need to evict items, you acquire a lock to update the lru metadata. most cache workloads are read-heavy, so this gives you lock-free performance where it matters most.

another trick is to use thread-local caches with a shared backing cache. each thread has its own small cache that requires no synchronization. only cache misses go to the shared cache. this reduces contention significantly.

benchmarking matters

you can't know if your lock-free cache is actually faster without benchmarking. and not just microbenchmarks—you need to test with realistic workloads that match your production patterns.

things to measure:

• throughput at different thread counts

• latency percentiles (p50, p99, p999)

• cache hit rate

• memory overhead

sometimes a simpler locked cache with good cache line awareness and padding will outperform a fancy lock-free implementation. the only way to know is to measure.

existing solutions

you don't always need to roll your own. there are good existing implementations:

caffeine (java): a high-performance caching library that uses advanced techniques like tinylfu for admission policy and w-tinylfu for eviction. it's not fully lock-free but achieves excellent concurrent performance.

folly concurrent hash map (c++): facebook's folly library has a concurrent hash map that can be combined with lru tracking. it uses fine-grained locking and lock-free techniques.

concurrent lru in rust: the rust ecosystem has several implementations that leverage rust's ownership system to provide safe concurrent access.

when to use lock-free

lock-free isn't always the answer. here's when it makes sense:

high contention: when you have many threads competing for cache access, lock-free can significantly reduce contention.

predictable latency: locks can cause priority inversion and unpredictable delays. lock-free gives you better tail latencies.

read-heavy workloads: if most operations are reads, lock-free reads give you massive parallelism.

when to avoid it:

complexity isn't worth it: if your cache isn't a bottleneck, the added complexity of lock-free code isn't justified.

write-heavy workloads: lots of writes mean lots of cas retries. a well-designed locked implementation might actually be faster.

simple requirements: if you just need a cache for a low-traffic service, use the standard library's map with a mutex. done.

lessons learned

building lock-free data structures taught me several things:

first, correctness is hard. lock-free code has subtle bugs that only show up under high concurrency or specific race conditions. you need extensive testing with thread sanitizers and stress tests.

second, performance is hardware-dependent. what's fast on your laptop might be slow on the production server with different cpu architecture and cache topology.

third, documentation matters. lock-free code is hard to understand. future you (or your teammates) will thank you for explaining the invariants and memory ordering requirements.

fourth, there's no silver bullet. sometimes the best solution is a boring mutex with careful lock granularity. premature optimization and all that.

final thoughts

lock-free lru caches are a fascinating intersection of systems programming, concurrent algorithms, and performance engineering. they're not always necessary, but when you need them, they can unlock significant performance gains.

the key is understanding the tradeoffs. lock-free doesn't mean faster—it means different performance characteristics. you trade algorithmic complexity for reduced contention. whether that's worth it depends on your specific use case.

if you're building a high-performance cache for a latency-sensitive system with high concurrency, investigating lock-free approaches is worth your time. start simple, measure everything, and optimize based on data.

and remember: the best cache is often no cache at all. sometimes the right answer is to make the underlying operation fast enough that you don't need caching. but when you do need a cache, and it needs to be concurrent, lock-free techniques are a powerful tool in your arsenal.

go build something fast.

this is a living library of quotes, anime, manga, songs, movies, and engineers i admire.

i add new things over time.

the pieces in red either hit me hard or are things i really want.

quotes i live by

"Stay hungry, stay foolish" — Steve Jobs

"Move fast and break things" — Mark Zuckerberg

"Make it work, make it right, make it fast" — Kent Beck

"The best way to predict the future is to invent it" — Alan Kay

"Talk is cheap. Show me the code" — Linus Torvalds

"Simplicity is prerequisite for reliability" — Edsger Dijkstra

"First, solve the problem. Then, write the code" — John Johnson

anime i love

One Piece

Vinland Saga

manga i read

Berserk

Vagabond

songs i like

Avicii - Wake Me Up

Illenium - Good Things Fall Apart

The Chainsmokers - Closer

movies i love

The Social Network (120m)

Interstellar (169m)

The Imitation Game (114m)

engineers i admire

Linus Torvalds

DHH (David Heinemeier Hansson)

John Carmack

Jeff Dean