👋 Hey {{first_name|there}},
Why this matters / where it hurts
Picture this. A developer ships a small change to the Recommendations service. Nothing dramatic. Just a tweak to the "You might also like" carousel. Twenty minutes pass. Then the alerts fire.
Login is down.
Customers hammer the sign-in button. Nothing happens. Revenue flatlines. The war room fills with engineers, and everyone asks the same question: how did a carousel update break authentication?
The post-mortem tells the story. Recommendations and Login shared a database connection pool. The new code held connections too long, starved the pool dry, and Login couldn't verify a single credential. A feature nobody would miss took down the one flow nobody can live without.
This is the blast radius problem, and it spreads through shared infrastructure like fire through dry brush. Last issue, we explored circuit breakers and how they stop retry storms from cascading across your system. This week, we go deeper. We're not just stopping the spread. We're building walls so the fire can never reach your critical paths in the first place.
🧭 Mindset shift
From: "All services share our common infrastructure for efficiency."
To: "Tier-1 paths get isolated resources. Nothing else touches them."
Efficiency sounds reasonable. Share the pools. Share the queues. Share the caches. Maximize utilization. Then your logging service has a bad day, and suddenly, checkout is unreachable. The efficiency argument collapses the moment you calculate the cost of that outage.
The bulkhead pattern comes from shipbuilding. Vessels have sealed compartments. Breach one, and water floods that section alone. The ship stays afloat because the damage stays contained. In software, bulkheads mean dedicated resources for your most critical paths, resources that nothing else can drain, saturate, or corrupt.
Defaults worth adopting:
Tier your services explicitly. Tier-1 means revenue dies if it dies. Tier-2 matters but won't stop the business. Tier-3 is nice to have.
Tier-1 flows get their own connection pools, their own thread pools, and, where budget allows, their own infrastructure.
Let Tier-3 services share with each other. Never with Tier-1.
🧰 Tool of the week: Bulkhead Design Checklist
Use this when designing new critical paths or auditing existing ones.
Identify your Tier-1 flows. Login. Checkout. Payment confirmation. Core API endpoints. If they stop, money stops.
Map every shared resource. Database connection pools. HTTP client pools. Message queues. Cache clusters. Thread pools. Write down which services touch each one.
Flag cross-tier sharing. Any place a Tier-3 service shares a resource with Tier-1 is a blast radius risk waiting to detonate.
Separate connection pools first. Database connections cause most pool starvation incidents. Create dedicated pools for Tier-1 services. Set hard limits. Monitor them independently.
Isolate thread pools for async work. Background job backlogs should never starve your request-handling threads. Separate pools. Separate limits. Separate fates.
Consider dedicated infrastructure for Tier-1. Dedicated read replicas. Separate Redis clusters. Isolated load balancer pools. Costs rise. So does your ability to sleep through the night.
Cap Tier-3 consumption. Full isolation isn't always possible. When it isn't, constrain what lower-tier services can take. Connection pool maximums. Rate limits. CPU and memory quotas.
Monitor each bulkhead. Track pool utilization, queue depth, and latency for every isolated resource. Set alerts that fire before saturation, not after.
Test the boundaries. Chaos engineering isn't optional here. Kill Tier-3 resources. Verify Tier-1 stays healthy. If it doesn't, your bulkheads have holes.
Document the isolation map. A simple diagram showing which services use which pools. Update it during every architecture review. Stale documentation is almost as dangerous as no documentation.
🔍 Example: E-commerce platform with a shared database pool
Scope: Isolate checkout from a recommendations service that keeps starving the connection pool.
Context: One PostgreSQL cluster. One connection pool with a maximum of 100 connections. Four services share it: Checkout, Recommendations, Inventory, and Reporting. During traffic spikes, Recommendations runs expensive queries that hog connections for seconds at a time.
The fix, step by step:
Tier the services. Checkout is Tier-1. Inventory is Tier-2. Recommendations and Reporting are Tier-3.
Create a dedicated pool for Checkout. Thirty connections. Separate credentials. Its own monitoring dashboard.
Create a shared pool for everything else. Fifty connections total, with per-service caps. Recommendations gets fifteen max. Reporting gets ten.
Reserve twenty connections for operations work and migrations.
Build dashboards showing utilization for each pool.
Set alerts at 70% for the Tier-1 pool. You want to know before it hurts.
Run chaos tests. Saturate the Tier-3 pool completely. Watch checkout latency. It should stay flat.
Small confession: We left Inventory in the shared pool. Separating it required code changes we didn't have bandwidth for. It's on the backlog now, and we'll get there. Sometimes, 80% isolation ships, and you iterate toward 100%.
Success signals: Checkout p99 latency holds steady during Recommendations traffic spikes. Alerts fire for Tier-3 pool pressure. Tier-1 stays quiet.
✅ Do this / avoid this
Do:
Tier services explicitly. Write it down. Make it official.
Give Tier-1 dedicated connection pools with hard, enforced limits.
Monitor each bulkhead on its own dashboard.
Test isolation with controlled failures before production tests it for you.
Start with database connections. They're the usual suspect.
Avoid:
Telling yourself you'll isolate later. You won't. Not until the outage forces your hand.
Sharing thread pools between request handling and background jobs. They compete. One loses. It's usually requests.
Giving Tier-3 services unlimited access to anything shared.
Assuming cloud auto-scaling eliminates the need for bulkheads. It doesn't. Scaling takes time. Pool starvation is instant.
Treating every service as equally critical. They aren't. Pretending otherwise is expensive.
🧪 Mini challenge
Goal: Map the blast radius risks hiding in one critical flow. Thirty to forty-five minutes.
Pick a Tier-1 flow. Login works. So does checkout. So does your core API.
List every shared resource it touches. Connection pools. Caches. Queues. Thread pools.
For each resource, identify every other service that uses it.
Find the riskiest cross-tier sharing. The Tier-3 service sits in the same pool as your Tier-1 flow.
Sketch a simple isolation plan. What would you separate first? Why?
Bonus: estimate the cost and the effort.
Reply when you're done. Most teams discover at least one "wait, that shares a pool with checkout?" moment. I'd love to hear yours.
🎯 Action step for this week
Run the bulkhead checklist against your top two Tier-1 flows.
Create a shared document listing every cross-tier resource dependency.
Propose one concrete isolation improvement. Bring it to your team.
Add a "resource isolation" section to your architecture review template.
If bulkheads already exist, schedule a chaos test. Verify they hold.
By Friday: Have a documented map of your Tier-1 blast radius risks. Have one isolation improvement sitting in your backlog, prioritized.
👋 Wrapping up
Your Recommendations service should never break the login. Ever. That's not a goal. That's a requirement.
Shared resources mean shared risk. Bulkheads transform shared risk into contained risk. Start with connection pools. They're the most common blast radius multiplier, and they're usually the cheapest to fix.
Efficiency matters. But efficiency is not worth a revenue outage. Not once. Not ever.
Want to build the instincts that catch these risks before they reach production? The free 5-day course will help: From Dev to Architect.
⭐ Most read issues (good place to start)
If you’re new here, these are the five issues readers keep coming back to:
Thanks for reading.
See you next week,
Bogdan Colța
Tech Architect Insights