This website uses cookies

Read our Privacy policy and Terms of use for more information.

👋 Hey {{first_name|there}},

A worker can pass every health check while doing exactly zero work. Here's how to stop that from happening to a service you own.

Why this matters / where it hurts

There was a worker we had running in production for months. It pulled events off a queue and pushed them into a downstream system. Nobody touched it because it just worked, until one Tuesday it didn't.

A thread inside it died and the pod stayed up. The /health endpoint kept returning 200, CPU and memory looked normal, the restart count never moved. We didn't have a queue depth chart because we'd never gotten around to building one. Six hours later a support engineer pinged me asking where some customer's thing was.

That's what I'd call a dark service. Technically up. Every standard signal green. And it has quietly stopped doing the one job it was built to do. You don't find out from your monitoring stack. You find out from a customer, or from a downstream team wondering why their data hasn't moved in a while, or, more often than I'd like, from luck.

In Lesson #38 on distributed tracing, I argued that traces are how you cut MTTR once you're already inside an incident. This lesson sits upstream of that. It's about not walking into the incident blind in the first place. The idea isn't particularly novel: every service should ship with an observability contract, in the same spirit as its API contract.

🧭 The shift

From: instrumentation is something you bolt on when things break To: instrumentation is part of the contract the service signs by existing

If you can't observe a service, you don't actually know it's working. You're trusting that the lack of complaints means the lack of problems, and that holds until the day it doesn't. On that day you'll build a dashboard, which is the wrong time to build a dashboard.

I think the move that helps most is making the observability contract a deliverable that lives in the ADR, alongside the API design. That puts it in front of reviewers before anyone writes a line of business logic, which is roughly the only time anyone has the patience to argue about telemetry.

  • Treat the observability contract as something the ADR has to answer, not something the runbook fills in later.

  • A new service doesn't ship until whoever's going to be on-call has signed off on what they'll see when it misbehaves.

  • If a metric only lives in someone's head, it doesn't exist.

Subscribe to keep reading

This content is free, but you must be subscribed to Tech Architect Insights to continue reading.

Already a subscriber?Sign in.Not now

Keep Reading