👋 Hey {{first_name|there}},
Why this matters / where it hurts
You’ve probably seen this one. The API is fine most of the day, then it starts timing out in bursts. CPU is not crazy. The database looks healthy enough. Yet p95 climbs and the on-call channel gets that familiar tone. Someone suggests adding retries. Someone else suggests bumping the timeout. You do it because you need the fire to go down.
Last week, we talked about pagination melting down on deep pages.
https://www.techarchitectinsights.com/p/pagination-that-doesn-t-die-at-page-2000
That pain looks like “the database is slow,” but it’s really wasted work. Locking incidents feel similar. The database is not always busy. Sometimes it’s just waiting behind one long transaction that blocks everything else.
I used to treat locks like a deep database topic. Something I would understand later, when I had time. Then we hit a week where “sometimes it hangs” became a daily incident. It was not random. It was waiting. A few long transactions held locks, and everything behind them queued until it couldn’t.
We’ll treat locking as a first-class part of system behavior, like latency budgets and backpressure. One simple triage runbook, one recognizable example, and a few defaults that stop lock pain from surprising you.
🧭 Mindset shift
From: “The database is slow sometimes; we will tune it later.”
To: “Most ‘random’ DB slowness is waiting. Find the blocker, then reduce lock time.”
Lock incidents feel mysterious because waiting hides. Queries look idle. Threads sit. The system is not doing a lot of work. It is stuck behind a lock held by a transaction that is doing something slightly too long.
Two rules that make this manageable
Do not optimize the victim first. Find the blocker.
Reduce lock duration before you chase micro-optimizations.
A small extra rule I like, even though it feels strict
If a transaction can run for seconds, treat it like a batch job. Give it a throttle and an escape hatch.
🧰 Tool of the week: Lock Triage Runbook
Keep this one page next to your on-call docs. It works for most relational databases, even if the commands differ.
Confirm it is locked waiting
Check DB metrics for lock waits, blocked sessions, or rising queue times. If p95 rises without CPU rising, suspect waiting.Identify the victims
List the top blocked queries and their wait type. Capture query text, duration, and the table or index involved.Find the blocker chain
For each victim, find the session holding the lock. If there is a chain, trace to the root blocker. Do not stop at the first hop.Classify the blocker
Is it a long transaction, a migration, a bulk update, a hot row pattern, or an unexpected lock escalation? Write one sentence. This drives the fix.Decide on the immediate action
Choose one, quicklyLet it finish if it is near completion, and the business impact is tolerable
Cancel the victim if it is safe and will recover
Kill the blocker if it is safe, and you must restore service
If you kill anything, record the session ID and what it was doing.
Stabilize the system
Turn on guardrails. Shed load, reduce concurrency, pause backfills, and disable the feature that triggers the hot write. Do not just increase timeouts.Reduce lock duration at the source
Common movesBreak a large update into smaller batches with commits
Move heavy work out of the transaction
Use optimistic patterns where possible
Add or fix an index so the update touches fewer rows and finishes faster
Fix the hot spot
If a single row or a small set of rows is the hot lock, redesign that write path. Shard counters, use append-only logs, or queue writes.Add observability for next time
Dashboard tiles for blocked sessions, lock wait time, and top blockers by query fingerprint. Alert on sustained lock waits.Add an escape hatch
Every operation that can hold locks should have a kill switch. Backfills, migrations, bulk jobs, and admin endpoints. Document who can use it.
🔍 Example: Bulk backfill that deadlocks checkout
Scope
A backfill updates orders.status and touches related rows. It runs during business hours because someone wanted it done today.
Context
Checkout writes to orders and order_items. The backfill also updates orders plus joins into order_items a derived field. Two code paths touch the same tables in a different order.
Step by step using the runbook
Confirm lock waiting. p95 rises, CPU stays modest, blocked sessions climb.
Identify victims. Checkout requests are waiting on row locks in
orders.Find the blocker chain. Root blocker is the backfill transaction holding locks for a long time.
Classify blocker. Long transaction plus inconsistent lock order between backfill and checkout. Deadlock risk is real.
Immediate action. Pause the backfill job and let existing checkout transactions drain.
Stabilize. Reduce checkout concurrency briefly and watch lock waits fall.
Reduce lock duration. Change the backfill to process batches of 500 rows, commit each batch, and add a small sleep when lock waits rise.
Fix the hot spot. Ensure both checkout and backfill touch tables in the same order. Add an index to make the backfill join selective.
Observability. Add a panel for top blockers and a metric for backfill batch duration.
Escape hatch. Add a kill switch for the backfill worker and require a feature flag for production runs.
What success looks like
Check out p95 returns to normal. Lock waits drop to near zero. The backfill completes more slowly, but predictably. No one has to restart the database.
Small confession
Sometimes I stop a backfill even when it is not the root cause, just to shrink the problem space. It is not elegant. It buys clarity.
✅ Do this / avoid this
Do
Look for waiting signals before tuning queries
Find the root blocker and shorten its transaction
Batch large writes with frequent commits
Keep a consistent lock order across code paths
Give backfills and migrations, throttles, and kill switches
Avoid
Increasing timeouts as the first response
Retrying blindly while locks are held
Running large migrations at peak traffic without safeguards
Holding transactions open while doing network calls or heavy computation
Assuming deadlocks are rare and therefore ignorable
🧪 Mini challenge
Goal: make one lock-heavy operation reversible today.
Pick one operation that writes many rows. Backfill, admin job, or migration step.
Measure how long its transaction stays open right now.
Change it to run in batches with a commit per batch.
Add a simple throttle, reduce batch size when lock waits rise.
Add a kill switch and document who can use it.
Add one dashboard tile, blocked sessions, or lock wait time.
Hit reply and tell me what you changed and what surprised you.
🎯 Action step for this week
Inventory the top three write-heavy jobs that can run in production
Add a standard wrapper, batching, throttle, and kill switch for each
Standardize lock order for the tables touched by checkout, billing, and user updates
Add a lock health section to your main dashboard, blocked sessions, lock wait time, and top blockers
Create a policy, no bulk write jobs during peak hours without a feature flag and an on-call owner
By the end of this week, aim to have one backfill path and one migration path that are explicitly safe to stop, throttle, and resume.
👋 Wrapping up
Most database slowness during incidents is waiting, not CPU.
Find the blocker, then shorten the transaction.
Batch large writes and keep lock order consistent.
Add throttles and kill switches before you need them.
⭐ Most read issues (good place to start)
If you’re new here, these are the five issues readers keep coming back to:
Thanks for reading.
See you next week,
Bogdan Colța
Tech Architect Insights