👋 Hey {{first_name|there}},

Why this matters / where it hurts

You’ve probably seen this one. The API is fine most of the day, then it starts timing out in bursts. CPU is not crazy. The database looks healthy enough. Yet p95 climbs and the on-call channel gets that familiar tone. Someone suggests adding retries. Someone else suggests bumping the timeout. You do it because you need the fire to go down.

Last week, we talked about pagination melting down on deep pages.
https://www.techarchitectinsights.com/p/pagination-that-doesn-t-die-at-page-2000

That pain looks like “the database is slow,” but it’s really wasted work. Locking incidents feel similar. The database is not always busy. Sometimes it’s just waiting behind one long transaction that blocks everything else.

I used to treat locks like a deep database topic. Something I would understand later, when I had time. Then we hit a week where “sometimes it hangs” became a daily incident. It was not random. It was waiting. A few long transactions held locks, and everything behind them queued until it couldn’t.

We’ll treat locking as a first-class part of system behavior, like latency budgets and backpressure. One simple triage runbook, one recognizable example, and a few defaults that stop lock pain from surprising you.

🧭 Mindset shift

From: “The database is slow sometimes; we will tune it later.”
To: “Most ‘random’ DB slowness is waiting. Find the blocker, then reduce lock time.”

Lock incidents feel mysterious because waiting hides. Queries look idle. Threads sit. The system is not doing a lot of work. It is stuck behind a lock held by a transaction that is doing something slightly too long.

Two rules that make this manageable

  • Do not optimize the victim first. Find the blocker.

  • Reduce lock duration before you chase micro-optimizations.

A small extra rule I like, even though it feels strict

  • If a transaction can run for seconds, treat it like a batch job. Give it a throttle and an escape hatch.

🧰 Tool of the week: Lock Triage Runbook

Keep this one page next to your on-call docs. It works for most relational databases, even if the commands differ.

  1. Confirm it is locked waiting
    Check DB metrics for lock waits, blocked sessions, or rising queue times. If p95 rises without CPU rising, suspect waiting.

  2. Identify the victims
    List the top blocked queries and their wait type. Capture query text, duration, and the table or index involved.

  3. Find the blocker chain
    For each victim, find the session holding the lock. If there is a chain, trace to the root blocker. Do not stop at the first hop.

  4. Classify the blocker
    Is it a long transaction, a migration, a bulk update, a hot row pattern, or an unexpected lock escalation? Write one sentence. This drives the fix.

  5. Decide on the immediate action
    Choose one, quickly

    • Let it finish if it is near completion, and the business impact is tolerable

    • Cancel the victim if it is safe and will recover

    • Kill the blocker if it is safe, and you must restore service
      If you kill anything, record the session ID and what it was doing.

  6. Stabilize the system
    Turn on guardrails. Shed load, reduce concurrency, pause backfills, and disable the feature that triggers the hot write. Do not just increase timeouts.

  7. Reduce lock duration at the source
    Common moves

    • Break a large update into smaller batches with commits

    • Move heavy work out of the transaction

    • Use optimistic patterns where possible

    • Add or fix an index so the update touches fewer rows and finishes faster

  8. Fix the hot spot
    If a single row or a small set of rows is the hot lock, redesign that write path. Shard counters, use append-only logs, or queue writes.

  9. Add observability for next time
    Dashboard tiles for blocked sessions, lock wait time, and top blockers by query fingerprint. Alert on sustained lock waits.

  10. Add an escape hatch
    Every operation that can hold locks should have a kill switch. Backfills, migrations, bulk jobs, and admin endpoints. Document who can use it.

🔍 Example: Bulk backfill that deadlocks checkout

Scope
A backfill updates orders.status and touches related rows. It runs during business hours because someone wanted it done today.

Context
Checkout writes to orders and order_items. The backfill also updates orders plus joins into order_items a derived field. Two code paths touch the same tables in a different order.

Step by step using the runbook

  • Confirm lock waiting. p95 rises, CPU stays modest, blocked sessions climb.

  • Identify victims. Checkout requests are waiting on row locks in orders.

  • Find the blocker chain. Root blocker is the backfill transaction holding locks for a long time.

  • Classify blocker. Long transaction plus inconsistent lock order between backfill and checkout. Deadlock risk is real.

  • Immediate action. Pause the backfill job and let existing checkout transactions drain.

  • Stabilize. Reduce checkout concurrency briefly and watch lock waits fall.

  • Reduce lock duration. Change the backfill to process batches of 500 rows, commit each batch, and add a small sleep when lock waits rise.

  • Fix the hot spot. Ensure both checkout and backfill touch tables in the same order. Add an index to make the backfill join selective.

  • Observability. Add a panel for top blockers and a metric for backfill batch duration.

  • Escape hatch. Add a kill switch for the backfill worker and require a feature flag for production runs.

What success looks like
Check out p95 returns to normal. Lock waits drop to near zero. The backfill completes more slowly, but predictably. No one has to restart the database.

Small confession
Sometimes I stop a backfill even when it is not the root cause, just to shrink the problem space. It is not elegant. It buys clarity.

Do this / avoid this

Do

  • Look for waiting signals before tuning queries

  • Find the root blocker and shorten its transaction

  • Batch large writes with frequent commits

  • Keep a consistent lock order across code paths

  • Give backfills and migrations, throttles, and kill switches

Avoid

  • Increasing timeouts as the first response

  • Retrying blindly while locks are held

  • Running large migrations at peak traffic without safeguards

  • Holding transactions open while doing network calls or heavy computation

  • Assuming deadlocks are rare and therefore ignorable

🧪 Mini challenge

Goal: make one lock-heavy operation reversible today.

  • Pick one operation that writes many rows. Backfill, admin job, or migration step.

  • Measure how long its transaction stays open right now.

  • Change it to run in batches with a commit per batch.

  • Add a simple throttle, reduce batch size when lock waits rise.

  • Add a kill switch and document who can use it.

  • Add one dashboard tile, blocked sessions, or lock wait time.

Hit reply and tell me what you changed and what surprised you.

🎯 Action step for this week

  • Inventory the top three write-heavy jobs that can run in production

  • Add a standard wrapper, batching, throttle, and kill switch for each

  • Standardize lock order for the tables touched by checkout, billing, and user updates

  • Add a lock health section to your main dashboard, blocked sessions, lock wait time, and top blockers

  • Create a policy, no bulk write jobs during peak hours without a feature flag and an on-call owner

By the end of this week, aim to have one backfill path and one migration path that are explicitly safe to stop, throttle, and resume.

👋 Wrapping up

  • Most database slowness during incidents is waiting, not CPU.

  • Find the blocker, then shorten the transaction.

  • Batch large writes and keep lock order consistent.

  • Add throttles and kill switches before you need them.

⭐ Most read issues (good place to start)

If you’re new here, these are the five issues readers keep coming back to:

Thanks for reading.

See you next week,
Bogdan Colța
Tech Architect Insights

Keep Reading