👋 Hey {{first_name|there}},
Most teams implement the happy path of a saga and skip the seventeen ways it can fail silently. This issue is about building compensations that don't quietly make things worse.
Why this matters / where it hurts
You've probably seen this one play out. A multi-service order flow: validate inventory, reserve stock, charge payment, schedule fulfillment, send confirmation. Five steps, five services, each one owns its own data. The first four succeed. Step five fails because the notification service is down. No big deal, right? Retry and move on.
Except the retry also fails. Meanwhile, the payment captured real money. The stock reservation is holding units other customers need. The saga coordinator logged the transaction as "in progress" forty minutes ago, and nobody's watching. You now have a zombie saga: not failed, not complete, just sitting there quietly making your data more wrong with every passing minute. When someone finally spots it, the compensation logic to unwind the payment runs twice because nobody made it idempotent. Now you've double-refunded the customer, and your finance team is asking hard questions.
In Lesson #35 on data consistency, we covered the outbox pattern for getting events reliably out of a service boundary. That solves the "did my event actually publish" problem. This week, we tackle what happens after publish: coordinating a chain of steps across services when any link can snap, and making sure the unwind path is as robust as the happy path.
🧭 The shift
From: "We need to handle the failure case" (a single rollback path sketched on a whiteboard)
To: "Every saga step has at least three failure modes, and each compensation is a distributed operation that can fail independently"
Most teams treat saga design as a happy path plus one rollback arrow. The actual production surface is much wider. A compensation is itself a distributed operation. It can timeout, partially apply, run out of order, or execute more than once. If your compensations aren't designed with the same rigor as your forward steps, you haven't built a saga. You've built a system that makes messes faster than you can clean them up.
The moment that shifts your thinking is when you realize the compensation for step 3 might fail, while the compensation for step 4 has already succeeded. Now your data is in a state that nobody designed for. That's not an edge case. In high-throughput systems, it's a Tuesday.
Every compensation is its own distributed operation. Give it a dedicated timeout, a retry policy, and an idempotency key. Don't inherit these from the forward step.
Compensations won't arrive in the order you expect. Step 4 might unwind before step 3 does, especially under load. Design for that from day one, not after the first incident.
If you can't query "show me all sagas stuck longer than X minutes," you have no idea what's broken right now. Saga state needs to be queryable, not just loggable.
📘 New Career Guide
I just finished a major update to the From Developer to Architect career guide. It now includes a self-assessment rubric, a week-by-week 90-day growth plan, architecture artifact templates, and interview prep frameworks. If you're actively working toward a Staff, Tech Lead, or Architect role, this is the structured roadmap.
Free download here: https://www.techarchitectinsights.com/from-developer-to-architect-free-career-guide