👋 Hey {{first_name|there}},
A strange thing happens in many systems. Two services talk to each other; they are both well written, and still, the integration feels chaotic. Clients do not know which errors deserve a retry. APIs invent a new status code shape for every feature. Support tickets say the same vague line that you have seen before. Something like it failed again. Please try later.
I used to think this was just about better code. It rarely is. It is about the language your services use when they fail. If the language is random, everything downstream becomes guesswork. If the language is clear, even simple clients behave well. The system starts to feel steady. Not perfect, yet steady.
Today is a practical pass at that language. We will write an error contract that sits next to each service and becomes the rule everyone follows. It is one page. It connects to earlier issues in a very direct way. Timeouts and retries only make sense if errors tell the truth. Backpressure needs a clean way to say not now. SLOs and error budgets ask you to be honest about user impact. Dual run compares outputs and needs consistent failure shapes. Even caching is easier when errors are predictable. All of that starts with a small document and a few boring choices that you agree to in advance.
🧭 Mindset shift
From: Every service invents its own error style, and clients guess what to do
To: A simple shared contract where classes of errors carry clear behavior
An error should tell a client what action is safe. Try again with patience. Ask the user to fix something. Do not retry. That is the core idea. If you remember only that, you will already make better choices.
A few small inversions help
• Status codes matter less than semantics. A pretty code that does not guide behavior is not helpful
• Perfect detail in every response is impossible. A short taxonomy used everywhere is better than five exquisite schemas used once each
• Error bodies are for humans and for logs. Error classes are for machines. Keep both, but write them with different goals in mind
• Consistency across teams beats creativity inside one team. Boring is a feature here
🎯 Want to learn how to design systems that make sense, not just work?
If this resonated, the new version of my free 5-Day Crash Course – From Developer to Architect will take you deeper into:
Mindset Shift - From task finisher to system shaper
Design for Change - Build for today, adapt for tomorrow
Tradeoff Thinking - Decide with context, not dogma
Architecture = Communication - Align minds, not just modules
Lead Without the Title - Influence decisions before you’re promoted
It’s 5 short, focused lessons designed for busy engineers, and it’s free.
Now let’s continue.
🧰 Tool of the week: Error Contract Spec
Keep this as a single page in each service. Fill it once. Use it in reviews, client libraries, and incident debriefs. If a field feels hard, that is the conversation you needed.
1) Classes
Every error belongs to exactly one class. Clients act on the class without guessing
• Retryable
Temporary problem that may succeed later. Examples include timeouts, connection resets, partner flaps, and acknowledged overload. Safe action is a retry with exponential backoff and jitter within a small attempt budget
• User actionable
The caller or end user must change something. Examples include validation failures, insufficient permissions, missing required fields, and business rule denials. Safe action is to surface guidance and stop automatic retries
• Policy blocked
The system refuses on purpose. Examples include rate limits, quota exceeded, maintenance windows, regional blocks, and feature flags off. Safe action is to wait or downgrade. Automatic retries are not helpful unless a retry after is present
That is it. Three classes. No more. If you find yourself inventing a fourth, check if one of these would work with a clearer message
2) Shape
All errors share the same envelope. Keep it small so you will actually use it
{
"class": "retryable | user_actionable | policy_blocked",
"code": "ERROR_KEY",
"message": "short description",
"retry_after_ms": 0,
"correlation_id": "uuid-or-trace-id",
"details": {
"optional_details": "only if truly needed"
}
}
Notes to keep us honest
• The field named class drives client logic
• The code is stable and short. Examples include RATE_LIMITED, VALIDATION_FAILED, PARTNER_TIMEOUT
• Message is for humans and logs. Keep it brief and safe
• Retry after in milliseconds is zero when not applicable. When present, it means please back off at least this long
• Correlation ID must thread through logs and traces
• Details are optional and should stay small
3) Mapping from status code to class
Do not rely only on status codes. Map them to classes inside the contract so clients do not have to reverse engineer intent
• 408, 429 with retry after, 500, 502, 503, 504 map to retryable
• 400, 401, 403, 404, 409, 412 map to user actionable
• 429 without retry after, 423, 451 map to policy blocked
If your platform uses only a subset, keep the mapping explicit anyway. You can adjust later without changing the envelope
4) Client behavior
State the default behavior once, so it is not improvised in a panic
• Retryable
Attempts two total per request path, caller-owned. Exponential backoff with jitter. Budget per route documented in the service readme. If a retry after is present, wait at least that amount. Respect the timeout ladder from the Timeouts and Retries issue, so inner hops fail sooner than outer hops
• User actionable
Do not retry automatically. Surface the message. If an error code has a known remediation, link it in the docs. Consider structured hints in detail for first-class clients
• Policy blocked
If a retry after is present, a single delayed attempt is allowed. Otherwise, downgrade or shed. This is where brownouts live. You can return a partial or cached response when safe
5) Downgrade and fallback guidance
Write one or two lines for the few places that deserve a softer landing
• Search results may omit personal recommendations when the downstream is policy blocked
• Checkout may capture intent now and attempt settlement later when the partner is retryable but rate-limited
• Analytics tiles should display an as-of timestamp and use cached data when the policy is blocked
6) Observability
Errors that cannot be seen will be reinvented
• A panel that shows counts by class and by code
• A second panel that shows caller behavior next to errors. Retries made. Attempts dropped. Shed count
• A trace tag for error class and code
• Alerts that trigger on sudden spikes in retryable or on bursts of user action for a single endpoint or tenant
7) Versioning and governance
An error contract is a living document
• Changes require a short note in the service readme and a version bump in the client library
• Removal of a code requires a deprecation window and a migration note
• The spec owner is a person with a name, not the air around the service
📔 Defaults that play nicely with the rest of your toolkit
• Timeouts and retries only at one layer by default. The caller owns the budget, and other layers pass through
• Backoff with jitter always. Even a modest thirty percent jitter prevents synchronized spikes
• Retry budgets tied to SLOs. If the error budget is burning quickly, spend fewer attempts and shed earlier
• Rate limits that return policy blocked with a retry when possible. Clients behave better when the server is explicit
• Brownout switches for non-core features that flip when policy-blocked spikes. Protect the user-visible core
• Idempotency for write paths. If a write has any chance of retried delivery, it requires an idempotency key and stores outcomes
I realize this sounds rigid. In practice, it creates room to move quickly because teams stop renegotiating basic behavior every week
✍ A worked example
Consider an Order API that calls a payment provider and an inventory service. Users care that order creation completes within the SLO and that charges are not duplicated. The Order API exposes an error contract with the three classes and a small set of codes
• PAYMENT_PROVIDER_TIMEOUT
class retryable, message short and plain, retry after present during known partner incidents
• INVENTORY_CONFLICT
class user actionable, message describes the out of stock condition, no automatic retry, client shows a prompt
• RATE_LIMITED
class policy blocked, retry after set to a small window, clients may queue locally or offer an email when ready path
• PARTNER_DOWN_FOR_MAINTENANCE
class policy blocked, message human-friendly, clients switch to a degraded flow or simply say not now
Client behavior becomes obvious. The frontend will not hammer during maintenance. The backend will not stack retries on top of timeouts. Operations can see in one chart that retryable spiked because the partner flapped, and that retry after was respected. If a partial path is allowed, such as authorize now and capture later, the contract says so in the downgrade guidance. Nothing fancy. Just agreed behavior that stops guesswork.
🔄 Anti-patterns with gentler alternatives
• Rich status codes without a class. The client cannot act with confidence. Add the class and keep the code as flavor
• Hidden retries at multiple layers. The same request turns into six attempts without anyone noticing. Choose a single owner for retries and turn the others off
• Overloaded policy blocked that hides rate limits and blocks. If every policy event looks the same, teams cannot tune behavior. Split codes and keep classes stable
• Text-only errors. Logs may be readable, but clients still guess. Keep a short machine field for class and code
• Blanket retries on 4xx. Most of these are user-actionable. Retries waste time and money and increase noise
• Silence about retry after. If you know a safe window, say it. Clients are kinder when you are explicit
🧪 Mini challenge
Goal: replace guesswork with one clear contract on a noisy path today.
Pick scope
• One service with frequent on-call noise
• Two endpoints that cause most of the tickets or alertsSnapshot current behavior
• Pull the last seven days for those endpoints
• Note top status codes, retry counts, and p95 latency
• Save one example trace with a correlation IDDraft the Error Contract Spec
• Assign each endpoint two or three codes and exactly one class per code
Retryable, User actionable, or Policy blocked
• Write the shared envelope fields you will return
class, code, message, retry_after_ms, correlation_id, details
• Map current status codes to classes so clients never guessImplement in a test environment
• Return the envelope for those endpoints
• Set retry_after_ms when you mean “wait”
• Thread correlation_id through logs and tracesAdd a tiny client helper
• Retryable → attempts 2 with exponential backoff and jitter
• User actionable → surface guidance, no automatic retry
• Policy blocked → respect retry_after once or downgradeMake it visible
• Dashboard tiles: counts by class and by code
• Caller behavior: retries made, attempts dropped, shed countRun one scenario
• Simulate a partner timeout or a rate-limit event
• Observe retries, total wait time, and resulting user pathShare the outcome
• Post a screenshot of the dashboard
• Write one observation and one change you will keep
Timebox: 40 minutes end-to-end. Good enough beats perfect.
✅ Action step for this week
• Copy the template
Add the Error Contract Spec to the service README and your API docs
• Set the defaults once
Decide the three classes, status-to-class mapping, and the retry budget
Name the single layer that owns retries
• Ship the envelope
Return the shared structure from two endpoints in production behind a flag
• Publish a client helper
A small function that handles classes exactly once for consumers of this service
• Wire observability
Add a panel for counts by class and code
Add an alert for sudden spikes in Retryable or User actionable per endpoint or tenant
• Assign ownership
Write the spec owner’s name and a review date in 14 days
Note how you will deprecate or rename codes
• Announce the change
Post in the team channel with links to the README, helper, and dashboard
Ask client teams to adopt the helper by a specific date and offer help if needed
👋 Wrapping Up
• Use three classes. Retryable. User actionable. Policy blocked
• Share one small envelope for every error
• Map status codes to classes so clients do not guess
• Let one layer own retries with backoff and jitter
• Tie budgets to SLOs and let policy signals drive brownouts
• Make the dashboard show classes and codes so everyone can see the story
It will not remove every surprise. It will remove a lot of noise. And perhaps that is enough for a calmer week.
Thanks for reading.
See you next week,
Bogdan Colța
Tech Architect Insights