👋 Hey {{first_name|there}},
Why this matters / where it hurts
You’ve probably seen this one. The API is fine most of the day, then it starts timing out in bursts. CPU is not crazy. The database looks healthy enough. Yet p95 climbs and the on-call channel gets that familiar tone. Someone suggests adding retries. Someone else suggests bumping the timeout. You do it because you need the fire to go down.
Last week, we talked about pagination melting down on deep pages.
https://www.techarchitectinsights.com/p/pagination-that-doesn-t-die-at-page-2000
That pain looks like “the database is slow,” but it’s really wasted work. Locking incidents feel similar. The database is not always busy. Sometimes it’s just waiting behind one long transaction that blocks everything else.
I used to treat locks like a deep database topic. Something I would understand later, when I had time. Then we hit a week where “sometimes it hangs” became a daily incident. It was not random. It was waiting. A few long transactions held locks, and everything behind them queued until it couldn’t.
We’ll treat locking as a first-class part of system behavior, like latency budgets and backpressure. One simple triage runbook, one recognizable example, and a few defaults that stop lock pain from surprising you.
🧭 Mindset shift
From: “The database is slow sometimes; we will tune it later.”
To: “Most ‘random’ DB slowness is waiting. Find the blocker, then reduce lock time.”
Lock incidents feel mysterious because waiting hides. Queries look idle. Threads sit. The system is not doing a lot of work. It is stuck behind a lock held by a transaction that is doing something slightly too long.
Two rules that make this manageable
Do not optimize the victim first. Find the blocker.
Reduce lock duration before you chase micro-optimizations.
A small extra rule I like, even though it feels strict
If a transaction can run for seconds, treat it like a batch job. Give it a throttle and an escape hatch.