👋 Hey {{first_name|there}},
Cache invalidation isn’t “hard,” it’s vague
People like to joke that cache invalidation is one of the two hard things in software. I am not sure it is hard so much as fuzzy. We rarely write down who can tolerate being wrong, by how much, and for how long. So we guess a TTL, someone bumps it during an incident, and a month later, nobody remembers why the numbers feel off. I have done that. More than once.
This is a calmer route. Not perfect. Practical. You will get a small decision card you can paste in a repo, a few defaults that hold up in production, and a couple of places where I tripped before you have to. It also connects to earlier lessons. Backpressure helps when serving stale is the safest answer. SLOs turn staleness into a contract. Idempotency makes background refreshes safe.
🧭 Mindset Shift
From cache as a speed knob with a guessed TTL
To cache as an explicit staleness contract tied to what a user feels
A cache is not magic. It is a promise with a clock on it. The real question is not “how long should the TTL be”. It is “who can tolerate being slightly wrong, by how much, and for how long”. Once you say that out loud, choices get less dramatic.
Helpful inversions to keep nearby
• Hit ratio is not the hero. Staleness, you can explain. A lovely ninety-five percent hit ratio can still feel terrible if the five percent misses stampede the origin. Track age buckets and how many regenerations you coalesced.
• Global TTLs create quiet lies. Per-flow staleness budgets make sense. A product tile can be thirty to sixty seconds stale, and nobody blinks. Cart totals cannot. Write two numbers. Adjust later.
• “Purge everything on write” feels safe and burns money. Event or webhook-driven invalidation is kinder. Keep a small TTL as a guardrail.
• Blocking everywhere for fresh data slows the whole page. Stale-while-revalidate is the default for many views. For money and trust areas, such as cart totals, stay strict and skip SWR.
• Fuzzy keys cause crosstalk. Keys should match reality. Tenant, locale, currency, device class, and personalization flags. If it changes the view, it changes the key.
• Caching errors forever laminates a shrug. Brief negative caching with jittered retries quiets hot misses without hiding real issues.
You will still be wrong sometimes. That is fine. The win is moving from accidental staleness to intentional and measured staleness that you can show on a dashboard and revise without drama.
🎯 Want to learn how to design systems that make sense, not just work?
If this resonated, the new version of my free 5-Day Crash Course – From Developer to Architect will take you deeper into:
Mindset Shift - From task finisher to system shaper
Design for Change - Build for today, adapt for tomorrow
Tradeoff Thinking - Decide with context, not dogma
Architecture = Communication - Align minds, not just modules
Lead Without the Title - Influence decisions before you’re promoted
It’s 5 short, focused lessons designed for busy engineers, and it’s free.
Now let’s continue.
🧰 One Tool: Cache Decision Card
Keep this as a single page next to each endpoint or dataset. Fill the blanks. If a field feels hard, that is the conversation you needed.
1) Data or endpoint
Write it plainly. Examples include products by id, search results for a query, user profile header.
2) User impact if stale
Two or three lines. What would a person notice? A price outdated by thirty seconds may be fine on a catalog page. Inventory must be current in the cart.
3) Staleness budget
Pick numbers. For example, ninety-five percent of reads are no older than two seconds, and ninety-nine percent are no older than thirty seconds. Add a hard constraint, such as promotions reflected within five seconds during checkout.
4) Strategy
Choose one to start
A. TTL only. Serve from cache until TTL expires, then regenerate synchronously.
B. Stale-while-revalidate. If an entry has expired, serve stale immediately and refresh in the background.
C. Event-driven invalidation. Invalidate on write, webhook, or changefeed, with a small TTL as a backstop.
Optional add-on. Soft TTL plus SWR, where a short soft TTL guards recency and a longer hard TTL keeps you safe if events go missing.
5) Keys
Write the exact structure. Include every dimension that changes the view.
Example key format
cache : tenant_id : product : id : currency : code : locale : lang
6) Invalidation triggers
What evicts. Writes. Provider webhook. Price change topic. Who publishes it and where.
7) Negative caching
Will you cache not found or soft errors? Use a short TTL, such as five to thirty seconds, to avoid hot miss stampedes.
8) Refresh behavior and coalescing
Will you use single-flight so only one worker regenerates and others wait or serve stale? What is the timeout for regeneration? Do retries use backoff with jitter?
9) Fallback when the source is down
Maximum stale window you will serve. What to show if no entry exists. Skeleton UI. Cached sidebar. A short message that you will email later. Honest beats wrong.
10) Placement
Decide which layer holds what. Browser. CDN or edge. Gateway. Service memory or Redis. Database cache or materialized view. One sentence per layer so the future you remembers.
11) Observability
Link the dashboard. Show hit ratio, staleness distribution, coalesced requests, stampede rate, refresh errors, and evictions. Alert if stale-window usage exceeds a chosen threshold for a sustained period.
12) Owner and review date
A name and a date. Caches drift. Make the review explicit.
✅ Defaults that usually help
• SWR for read-heavy and non-critical UI is a good starting place. People see something quickly, and the origin breathes.
• TTL only fits places where correctness should pause the world. Keep TTL small and timeouts tight.
• Event-based invalidation shines for price and policy. If a webhook says price changed, believe it and evict quickly.
Starter numbers to try before you tune
• Catalog or listing tiles. SWR with a soft TTL between fifteen and sixty seconds and a stale window of five to ten minutes.
• Detail pages without price. Soft TTL five to fifteen seconds, and hard TTL two to ten minutes. Event invalidation if available.
• Price and inventory snippets. Event invalidation and a soft TTL of one to five seconds. No SWR during checkout.
• Search autosuggest. TTL only one to three seconds with a tight single-flight.
• Analytics tiles. SWR is measured in minutes with an “as of” timestamp visible in the UI.
Nothing here is sacred. It is just comfortable to start with.
🧪 A short example
Product detail pages mix very different freshness needs. A thirty to sixty-second stale description is usually fine. Price must be current in the cart.
Budget
• Ninety-five percent of reads are ten seconds old or fresher
• Ninety-nine percent within two minutes
Strategy
• Page shell at CDN uses SWR for thirty seconds for anonymous visitors
• Personalized block in the service cache uses SWR for fifteen seconds with a single flight
• Price and stock badge uses event invalidation and a soft TTL of two seconds during normal conditions and no SWR at checkout
Keys include tenant, currency, and locale.
Invalidation listens to a product update webhook for general data and a price topic for price-only keys.
Fallback when the price source is down shows “fetching price” and disables add to cart. It is conservative on purpose.
Observability shows hit ratio, percent served stale, refresh latency p95, and stampede count. An alert fires if stale exceeds twenty percent for fifteen minutes.
On paper, this mix looks messy. In production, it feels humane.
🚫 Anti-patterns with kinder alternatives
• One global TTL for everything feels simple and quietly wrong. Some views can take hours, and others need seconds. Move to per-flow budgets, and you will argue less.
• Personalization through a CDN without proper keying leaks language or currency across users. Include the real dimensions in the key or keep personalized fragments server-side.
• No single-flight means a thousand regenerations when a hot key expires. Turn on coalescing so one worker populates while others wait or serve stale.
• Caching errors with long TTLs hide problems. Negative cache for a few seconds, then retry with jitter so you do not create a hot miss storm.
• Blind invalidation on every write kills the hit ratio. Target keys by event and keep a small TTL to catch stragglers.
• Pretending stale is always safe leads to subtle harm. Use SWR where it is harmless and avoid it for totals or anything tied to money or compliance.
🗺️ Where to put the cache
You probably need more than one layer
• Browser is fastest and good for tiny TTL personal bits
• CDN or edge suits anonymous pages and images where SWR shines
• Gateway can hold coarse responses keyed by tenant or locale and apply gentle SWR
• Service memory or Redis holds personalized fragments with exact keys and single-flight
• Database cache or materialized views handle heavy joins with event-driven refresh
Stack the layers. Keep truths about keys and invalidations in one place so layers do not contradict each other.
📊 Observability that tells the truth
If you only add one panel this week, make it Staleness Distribution. Seeing “% of responses by age bucket” calms a lot of debates. Then add:
Hit ratio (overall + per key family)
Coalesced vs. parallel regenerations
Refresh failures and retry counts
Evictions by cause (TTL vs invalidation vs capacity)
SLO overlay: Is the cache helping you stay inside budget or hiding trouble?
Tiny nit: put the “as of” timestamp in the UI for any tile that uses stale data. Users are smarter than we think.
🧩 Mini-challenge
Pick one flow your users can feel (search, product detail, dashboard tile).
Fill a Cache Decision Card for it. Write the staleness numbers.
Choose SWR or TTL and set keys explicitly.
Add a single flight to prevent stampedes.
Negative-cache misses for 10s (or less) to quiet hot 404s.
Create a Staleness Distribution panel (buckets: <1s, 1–5s, 5–30s, 30–120s, >120s).
Ship behind a flag. Review after 3 days: did SLOs improve? Did origin CPU drop?
If the chart looks odd, it’s working; you’re finally seeing what users feel.
✅ Action step
Paste the Cache Decision Card into your service README.
Fill it for one endpoint and share a screenshot in your team channel.
Schedule a 20-minute check-in to adjust the numbers (don’t chase perfection).
If support gets fewer “why is it slow now?” tickets, keep going.
I sometimes start with shorter TTLs than I think I need, then lengthen once the dashboard looks sane. Other times, I do the opposite. Honest answer.
👋 Wrapping Up
• Write the staleness budget in numbers
• Pick SWR where it is harmless and TTL or events where honesty matters
• Key to reality
• Coalesce regenerations and use brief negative caching
• Watch staleness, not just hit ratio
Caching is not magic. It is a promise with a clock. Make the promise explicit, and you will not regret it.
Thanks for reading.
See you next week,
Bogdan Colța
Tech Architect Insights