SLOs & Error Budgets - Ship With Guardrails

👋 Hey {{first_name|there}},

You Can’t Manage What You Don’t Define

Have you ever noticed how teams argue forever about “Is the system healthy?”
One person points to CPU graphs. Another error counts. Someone else says users are fine. Meanwhile, release trains slip because nobody knows when it’s actually safe to ship, or when to slow down.

Here’s the truth, architects learn the hard way:

❝

Speed without guardrails burns trust.
Guardrails without speed burns momentum.

You need both. And the way you get both is SLOs (Service Level Objectives) and error budgets, simple, explicit rules about the reliability you promise and the risk you can spend to improve the product.

In the previous issue (#20), we learned to protect the core path with backpressure admission control, brownouts, and fast failure. SLOs and error budgets are the steering wheel for those tools: they tell you when to throttle, what to brown out, and when to freeze releases so you don’t burn the house down.

Let’s make reliability a number your team can actually use.

🧭 The Mindset Shift

From: “Ship until it breaks; fix it when it does.”
To: “Ship within the reliability budget so we stay fast and trustworthy.”

Most teams think of reliability as a vibe: “We’re mostly up,” “Latency feels okay.” Architects turn vibes into contracts:

SLI (Service Level Indicator): how you measure user experience (e.g., request success rate, p95 latency).
SLO (Service Level Objective): the target (e.g., 99.9% success over 30 days).
Error Budget: the allowed failure that remains (e.g., with 99.9% success, you can “spend” 0.1% failure).

That budget is not a shame metric; it’s fuel. You spend it on risky changes, learn fast, and back off before users notice. When you’re overspending, you slow down and stabilize. Clear. Honest. Actionable.

🎯 Want to learn how to design systems that make sense, not just work?

If this resonated, the new version of my free 5-Day Crash Course – From Developer to Architect will take you deeper into:

Mindset Shift - From task finisher to system shaper
Design for Change - Build for today, adapt for tomorrow
Tradeoff Thinking - Decide with context, not dogma
Architecture = Communication - Align minds, not just modules
Lead Without the Title - Influence decisions before you’re promoted

It’s 5 short, focused lessons designed for busy engineers, and it’s free.

👉 Join the 5-Day Crash Course Here »

Now let’s continue.

📔 Why SLOs + Error Budgets Change Behavior

Clarity for product: We agree on which user outcomes matter and how to trade speed for quality explicitly.
Focus for engineering: We optimize what counts (checkout success, search latency), not vanity metrics.
Release discipline: Green budget? Move. Burning hot? Pause, fix, then go.
Incident sanity: Debates stop. The question becomes: “How much budget did we burn?” and “What knob do we turn now?”

🧰 Tool: The Error-Budget Planner

Use this planner to define SLOs that matter, compute your budget, and wire it into daily decisions. It’s deliberately short so you’ll actually use it.

1) Pick One User-Visible SLI

Choose a signal that users feel:

Availability SLI: successful_requests / total_requests (e.g., HTTP 2xx/3xx, or business-level “checkout succeeded”).
Latency SLI: % of requests below X ms (e.g., 95% under 300 ms).
Freshness / Staleness SLI: % of reads fresher than Y seconds.
Task Success SLI: % of transactions that complete (cart→payment→receipt).

Tip: If you’re not sure, start with the core path from Issue #20 (login, checkout, pay). Reliability is your brand.

Write it down:

❝

SLI: “Checkout success = successful checkouts / checkout attempts over rolling 30 days.”

2) Set the SLO (Be Honest)

Pick the objective you can hold, not the one that flatters you. Common targets:

99.0% (two nines) → lenient, early-stage, or internal tools
99.9% (three nines) → most SaaS core flows
99.99% (four nines) → critical payments, auth, or APIs with strong ops

Availability example:

SLO: 99.9% checkout success over 30 days.

❝

30 days = 43,200 minutes.
0.1% of 43,200 = 43.2 minutes of allowed failure (your error budget).

If you choose 99.95%, the budget = 21.6 minutes.
At 99.99%, budget = 4.32 minutes.
Pick a bar you can meet reliably and improve later.

3) Break the Budget Into Guardrails

Your error budget is the maximum allowed pain before you must slow down. Make it operational:

Freeze Rule:

❝

If we burn 50% of monthly budget in 7 days, release freeze until we claw back.

Brownout Rule:

❝

If p95 latency ≥ target for 10 minutes, degrade non-core features (recommendations, heavy widgets).
If burn rate stays high, broaden brownout (more features dimmed or disabled).

Rollback Rule:

❝

Any release that spikes failures > X% for Y minutes auto-rolls back.

Release Cadence:

❝

Budget healthy and trending green? Accelerate risky changes (canaries, flags).
Budget hot? Stabilize: bug-fix focus, test coverage, on-call runbooks.

Write it down: Yes/no rules. No vibes.

4) Calculate & Communicate With Examples

Example A - Availability SLO

SLO: 99.9% success over 30 days → 43.2 minutes budget.
Incident 1: 12 minutes of user-visible checkout errors.
Incident 2: 6 minutes of intermittent failures.
→ Budget used: 18 minutes (42%).
Status: Healthy, proceed with canary releases.

Example B — Latency SLO

SLO: 95% requests under 300 ms over 7 days.
The Monday launch caused 94% for 3 hours.
Tuesday stabilizes at 96%.
Decision: Spend a small budget on rollout v2 (green). If another dip occurs, trigger brownout per rules.

Giving the product these numbers turns “feels slow” into “we spent 20% of the budget; next risky change waits 48 hours.”

5) Wire SLOs Into Observability (Decisions, Not Just Dashboards)

Golden Signals per SLI: success rate, p95 latency, traffic, saturation.
Budget Widgets: remaining minutes (availability) or remaining percentile slack (latency) front-and-center.
Burn Alerts: “At current burn rate, budget exhausted in 72 hours.”
Per-Feature Telemetry: core vs. non-core—so brownouts target the right features first.
Per-Tenant/Region Views: avoid one big customer/region burning everyone’s budget unnoticed.

You don’t need fancy tooling to start. A single Grafana/Loki/Datadog panel that says “Error Budget Remaining (30d)” changes behavior overnight.

6) Make Release & Incident Rules Boring and Visible

Add a living section to your runbook/README:

If budget > 70% → green lane: canaries, risky features allowed.
If 30–70% → cautious lane: staged rollouts only, extra eyes on on-call.
If < 30% → stabilize lane: freeze risky changes; brownout non-core if burn persists.
If < 10% → hard freeze; postmortem & reliability sprint.

Hook these to your feature flag and deploy tools so the lane is visible at deploy time. Every engineer should see the lane before clicking “release.”

7) Tie SLOs to Backpressure (Lesson #20)

Backpressure was your mechanical safety net. SLOs are your policy brain:

Admission Control: When the budget is low, tighten per-tenant limits or shrink concurrency on non-core paths.
Brownouts: Use SLO burn to decide when to dim/disable expensive features.
Circuit Breakers: Open them earlier if the burn is hot; serve cached or partial responses.
Retry Policy: Back off aggressively when SLOs dip to avoid retry storms.

SLOs tell you when to use each lever. Backpressure is how you use it.

8) Handle Common SLO Traps

Trap: Vanity SLIs (CPU, disk).
Fix: Only SLIs users feel (success, latency, freshness).
Trap: Too Many SLOs.
Fix: Start with one core-path SLO; add only when enforced.
Trap: Setting Four Nines Day One.
Fix: Choose a bar you can hold. Earn more nines with practice.
Trap: Silent Multi-Tenant Burn.
Fix: Segment SLOs by tenant/region; noisy neighbors shouldn’t torch everyone’s budget.
Trap: SLOs With No Consequence.
Fix: Pre-commit freeze/brownout/rollback rules. Announce them. Follow them.
Trap: Ignoring Latency Distribution.
Fix: Percentiles (p95/p99) and tail health matter more than averages.

9) Make It Real in 60 Minutes (Action Plan)

In your next engineering/product sync:

Pick one SLI for your core path (availability or latency).
Set the SLO you can hold this quarter (e.g., 99.9% or 95%<300ms).
Compute the error budget (e.g., 43.2 minutes for 99.9%/30d).
Publish three rules: freeze, brownout, rollback (copy/paste from above).
Add one budget widget to your main dashboard and a lane banner in your deploy tool.
Tell support what brownout messaging to use when budgets run hot.
Review weekly: green? speed up; yellow? stage carefully; red? stabilize.

If you do only this, you’ll still be ahead of 90% of teams arguing about “health” on Slack.

✅ Mini-Challenge (30–45 minutes)

Goal: Turn reliability from vibes into numbers your team uses this week.

Name one SLI user who actually feels
Pick a single core-path outcome (e.g., “checkout success” or “95% requests <300ms”). Write the exact formula in one sentence.
Set the SLO you can hold this quarter
Choose a realistic bar (e.g., 99.9% success / 30 days). Calculate the error budget in minutes (e.g., 43.2 min).
Publish three guardrails (copy/paste into your team channel)
- Freeze: Burn >50% of budget in 7 days → freeze risky releases.
- Brownout: If p95 or success dips for 10 min → dim/disable non-core features.
- Rollback: Any release that spikes failures >X% for Y min auto-rolls back.
Put the number where everyone can see it
Add a simple dashboard tile: Error Budget Remaining (30d) and pin it to your deploy tool/homeboard.
Practice one decision with the number
Look at today’s budget:
- Green (>70%) → approve one canary.
- Yellow (30–70%) → stage rollout; add extra eyes.
- Red (<30%) → stabilize; run a brownout drill on one non-core widget.

❝

Share a screenshot of your SLI/SLO + budget widget with your team. If it sparks a debate, good, you’re replacing opinion with an operating model.

🔍 A Concrete Story: Check Out That Stopped Whiplash Releases

A retail team’s checkout SLO: 99.9% success/30 days (budget 43.2 minutes).

Week 1: Two incidents total 16 minutes burned. Remaining budget: 27.2 minutes (63%).
Product wants a major pricing-engine change. Team checks the lane: Green. They canary to 5%, watch success SLIs, then roll forward safely.
Week 3: An upstream tax API flaps; they burn another 12 minutes. Remaining: 15.2 (35%). Lane flips to Caution. They brown out a heavy “You may also like” widget per rules (Issue #20), which reduces latency, and defer another risky feature until next week.
End of the month: Budget remaining 12 minutes. Users happy. Velocity preserved. No heroics.

The difference wasn’t magic; it was guardrails everyone agreed to.

👋 Wrapping Up

Define an SLI that the user feels.
Pick an SLO you can hold (earn nines over time).
Spend your error budget consciously—speed up when green, stabilize when burning.
Let SLOs drive your backpressure levers (Lesson #20).
Publish the rules. Follow them. Repeat.

That’s how you move fast and keep trust.

Thanks for reading.

See you next week,
Bogdan Colța
Tech Architect Insights