Bulkhead Architecture With Failures That Stay Contained

👋 Hey {{first_name|there}},

Why this matters / where it hurts

Picture this. A developer ships a small change to the Recommendations service. Nothing dramatic. Just a tweak to the "You might also like" carousel. Twenty minutes pass. Then the alerts fire.

Customers hammer the sign-in button. Nothing happens. Revenue flatlines. The war room fills with engineers, and everyone asks the same question: how did a carousel update break authentication?

The post-mortem tells the story. Recommendations and Login shared a database connection pool. The new code held connections too long, starved the pool dry, and Login couldn't verify a single credential. A feature nobody would miss took down the one flow nobody can live without.

This is the blast radius problem, and it spreads through shared infrastructure like fire through dry brush. Last issue, we explored circuit breakers and how they stop retry storms from cascading across your system. This week, we go deeper. We're not just stopping the spread. We're building walls so the fire can never reach your critical paths in the first place.

🧭 Mindset shift

From: "All services share our common infrastructure for efficiency."
To: "Tier-1 paths get isolated resources. Nothing else touches them."

Efficiency sounds reasonable. Share the pools. Share the queues. Share the caches. Maximize utilization. Then your logging service has a bad day, and suddenly, checkout is unreachable. The efficiency argument collapses the moment you calculate the cost of that outage.

The bulkhead pattern comes from shipbuilding. Vessels have sealed compartments. Breach one, and water floods that section alone. The ship stays afloat because the damage stays contained. In software, bulkheads mean dedicated resources for your most critical paths, resources that nothing else can drain, saturate, or corrupt.

Defaults worth adopting:

Tier your services explicitly. Tier-1 means revenue dies if it dies. Tier-2 matters but won't stop the business. Tier-3 is nice to have.
Tier-1 flows get their own connection pools, their own thread pools, and, where budget allows, their own infrastructure.
Let Tier-3 services share with each other. Never with Tier-1.

Bulkhead Architecture With Failures That Stay Contained

👋 Hey {{first_name|there}},

Why this matters / where it hurts

🧭 Mindset shift

Keep Reading

Tech Architect Insights

Bulkhead Architecture With Failures That Stay Contained

👋 Hey {{first_name|there}},

Why this matters / where it hurts

🧭 Mindset shift

Subscribe to keep reading

Keep Reading

Tech Architect Insights