👋 Hey {{first_name|there}},
Most teams implement the happy path of a saga and skip the seventeen ways it can fail silently. This issue is about building compensations that don't quietly make things worse.
Why this matters / where it hurts
You've probably seen this one play out. A multi-service order flow: validate inventory, reserve stock, charge payment, schedule fulfillment, send confirmation. Five steps, five services, each one owns its own data. The first four succeed. Step five fails because the notification service is down. No big deal, right? Retry and move on.
Except the retry also fails. Meanwhile, the payment captured real money. The stock reservation is holding units other customers need. The saga coordinator logged the transaction as "in progress" forty minutes ago, and nobody's watching. You now have a zombie saga: not failed, not complete, just sitting there quietly making your data more wrong with every passing minute. When someone finally spots it, the compensation logic to unwind the payment runs twice because nobody made it idempotent. Now you've double-refunded the customer, and your finance team is asking hard questions.
In Lesson #35 on data consistency, we covered the outbox pattern for getting events reliably out of a service boundary. That solves the "did my event actually publish" problem. This week, we tackle what happens after publish: coordinating a chain of steps across services when any link can snap, and making sure the unwind path is as robust as the happy path.
🧭 The shift
From: "We need to handle the failure case" (a single rollback path sketched on a whiteboard)
To: "Every saga step has at least three failure modes, and each compensation is a distributed operation that can fail independently"
Most teams treat saga design as a happy path plus one rollback arrow. The actual production surface is much wider. A compensation is itself a distributed operation. It can timeout, partially apply, run out of order, or execute more than once. If your compensations aren't designed with the same rigor as your forward steps, you haven't built a saga. You've built a system that makes messes faster than you can clean them up.
The moment that shifts your thinking is when you realize the compensation for step 3 might fail, while the compensation for step 4 has already succeeded. Now your data is in a state that nobody designed for. That's not an edge case. In high-throughput systems, it's a Tuesday.
Every compensation is its own distributed operation. Give it a dedicated timeout, a retry policy, and an idempotency key. Don't inherit these from the forward step.
Compensations won't arrive in the order you expect. Step 4 might unwind before step 3 does, especially under load. Design for that from day one, not after the first incident.
If you can't query "show me all sagas stuck longer than X minutes," you have no idea what's broken right now. Saga state needs to be queryable, not just loggable.
📘 New Career Guide
I just finished a major update to the From Developer to Architect career guide. It now includes a self-assessment rubric, a week-by-week 90-day growth plan, architecture artifact templates, and interview prep frameworks. If you're actively working toward a Staff, Tech Lead, or Architect role, this is the structured roadmap.
Free download here: https://www.techarchitectinsights.com/from-developer-to-architect-free-career-guide
🧰 Tool of the week: Saga Health Checklist
Saga Health Checklist: Audit any saga before it hits production
Timeout per step - Every forward step and every compensation has an explicit timeout. No step inherits a global default silently. Document the timeout value and what happens when it fires (retry? compensate? alert?).
Compensation idempotency - Each compensation can run two, three, or ten times with the same saga-instance ID and produce the same outcome. Test this explicitly: trigger the compensation twice in your integration suite and assert the end state matches.
Dead-letter routing - When a step or compensation exhausts its retries, it lands in a dead-letter queue with full saga context: saga ID, step number, timestamp, payload, and failure reason. Not just the raw message.
Zombie detection - A scheduled job or health check queries for sagas stuck in "in-progress" longer than your maximum expected duration. It alerts, not auto-resolves. Humans triage zombies until you trust your automation.
Out-of-order safety - Compensations are safe to execute regardless of which later steps have or haven't been completed. Validate this by running compensations in reverse order and in random order in tests.
Partial completion visibility - Your saga state store records which steps completed, which are compensated, and which are pending. An on-call engineer can look at one row and know exactly where the saga stopped.
Poison message isolation - A message that causes a compensation to crash repeatedly gets quarantined after N attempts. It doesn't block other sagas in the same queue. This is separate from dead-letter routing: dead letters are expected failures, poison messages are unexpected crashes.
Manual intervention runbook - For each saga type, a documented procedure exists for an engineer to manually complete or compensate a stuck saga. Include the exact database queries or API calls. Don't assume the person running it wrote the saga.
🔍 In practice: Three-service booking flow
Scenario: A travel platform processes bookings across three services: flight reservation, hotel reservation, and payment capture. Traffic spikes hard during holiday sale events. The saga coordinator is an orchestrator service using an outbox-backed event bus.
Scope: Flight → Hotel → Payment. Compensation path: reverse payment → cancel hotel → release flight.
Context: Team of six, ~2,000 bookings/hour during peaks, PostgreSQL-backed saga state store.
Step 1 (timeout per step): Flight reservation timeout set to 4 seconds (the downstream GDS API is slow). If it times out, no compensation is needed because nothing was reserved yet. Hotel and payment timeouts are 2 seconds each.
Step 2 (idempotency test): During load testing, we deliberately fired the hotel cancellation compensation three times for the same saga. The first call cancelled the reservation. The second and third returned 200 with no state change. We almost missed this, though, because our initial implementation checked "is this reservation active?" and threw a 404 on the second call. Had to switch to "cancel if active, acknowledge if already cancelled."
Step 3 (zombie detection): A cron job runs every 5 minutes, flags any saga older than 10 minutes as "in-progress." During the first holiday spike, we found 23 zombies in one hour. Most were hotel-service timeouts that never triggered the compensation path because the orchestrator's retry queue had backed up.
The tradeoff we accepted: We built manual intervention tooling before automated zombie resolution. That meant an engineer had to SSH in and run a script for each stuck saga during the first two spikes. Slow, but safe. We didn't trust automated resolution yet, and given what we'd seen with the idempotency bug, I think that was the right call.
Result: After the checklist audit and fixes, stuck sagas dropped from ~23/hour during peaks to 1-2/day. Mean time to resolve the remaining ones went from "whenever someone notices" to under 8 minutes, because the state store made them visible and the runbook told the on-call engineer exactly what to do.
✅ Do this / ❌ Avoid this
Do this:
Store saga state in a queryable store (a database table, not just message headers) with step-level granularity.
Test compensations under duplication. Your CI pipeline should include a test that fires every compensation twice and asserts convergence.
Set per-step timeouts that reflect actual downstream latency, not optimistic defaults. Measure first, then set.
Avoid this:
Relying on the message broker's built-in retry as your only compensation trigger. Broker retries are for transient blips. Compensation is business logic, and they need actual orchestration to run correctly.
Assuming compensations will execute in reverse order of the forward steps. They won't. Network delays, consumer lag, partition rebalances - ordering goes out the window under real traffic.
Logging a saga as "failed" with no record of which steps were completed and which were compensated. When your on-call engineer opens that row outside of business hours, a single status column tells them almost nothing.
🎯 This week's move
Pick one multi-service flow in your system that behaves like a saga (even if nobody calls it that). Map the forward steps and ask: does each one have an explicit compensation?
Check whether your compensations are idempotent. If there's no test proving it, assume they aren't. Write the test.
Try querying your system for long-running "in-progress" transactions right now. Can you? If the answer is "not really" or "I'd have to dig through logs," that's the first thing to fix.
Write a one-page runbook for manually resolving a stuck instance of your most common saga.
By the end of this week, aim to: Have a saga state query running that surfaces any transaction stuck longer than your expected maximum duration, with an alert attached.
👋 Wrapping up
Sagas break in production not because the forward path is hard to build, but because nobody gives the unwind path the same attention. Compensations time out. They run twice. They arrive out of order. And when they do, your system ends up in states nobody designed for.
Give your compensations the same rigor you'd give a payment flow. Honestly, that's what they are.
Help a friend think like an architect
Know someone making the jump from developer to architect? Forward this email or share your personal link. When they subscribe, you unlock rewards.
🔗 Your referral link: {{rp_refer_url}}
📊 You've referred {{rp_num_referrals}} so far.
Next unlock: {{rp_next_milestone_name}} referrals → {{rp_num_referrals_until_next_milestone}}
View your referral dashboard
P.S. I’m still working on two new rewards. If there’s something you are interested in, let me know 😉
⭐ Good place to start
I just organized all 40 lessons into four learning paths. If you've missed any or want to send a colleague a structured starting point, here's the page.
Thanks for reading.
See you next week,
Bogdan Colța
Tech Architect Insights