Queries and Indexes With Predictable Latency

👋 Hey {{first_name|there}},

Why this matters / where it hurts

You add a filter and a sort to a “simple” list endpoint. Traffic is fine at first. Then, end users hit it in waves, and p95 jumps from 120 ms to 2.8 s. The ORM looks clean. The plan is not. A missing composite index forces a scan and an ugly sort. Caches hide it until they do not.

I have shipped that kind of change. It looked harmless. Later, we found slow dashboards, a surprise CPU bill, and a nightly batch that never finished. The query was not evil. It was unplanned.

Treat queries and indexes as part of the design. Tie them to an SLO. Check the plan before you merge. Give yourself an escape hatch if the plan flips in production.

🧭 Mindset shift

From: “Write the query, fix slow ones later.”
To: “Design the access path up front. Guard the plan and the SLO.”

Why it matters
Latency is mostly the access path. If the database can reach rows by an index that matches your predicates and sort, you win. If it must scan and sort, you pay. Plans also change over time due to stats and data skew. Treat plans as contracts you observe and, when needed, pin or guide.

Two keys

Design to the SLO and the user path, not to a vague “fast enough.”
Index for how you filter and order, then verify with an explain plan on production-like data.

🧰 Tool of the week: Query and Index Decision Sheet

Keep this one page per hot route or report. Fill it before you merge. Update it when the access path changes.

Scope and SLO
Route or job, expected QPS, p95 target, and max rows returned. Write a number.
Query shape
Predicates, joins, order by, and limit. Specify pagination model. Offset or keyset.
Data shape
Tables, row counts, cardinality on key columns. Note partitions or shards.
Index plan
List the exact index you need. For composite indexes, order columns by selectivity and by how you filter and sort. Add include columns for covering reads where available.
Explain and validate
Run EXPLAIN (ANALYZE, BUFFERS) (on production-like data. Confirm the index is used, join type is acceptable, estimated rows look sane, and no big sort or hash spills appear.
Write the overhead budget
Every index costs writes. State the acceptable penalty for inserts or updates. If the cost exceeds the budget, pick a narrower or partial index or move work to a materialized view.
Pagination strategy
Prefer keyset pagination for deep pages. If you must use offset, cap it and add a next-token path later.
Parameter rules
Use prepared statements. Be explicit about parameters that cause plan flips. If you pin a plan or add a hint, document why and the exit criteria.
Observability
Add tiles for p95, rows read vs rows returned, temp file usage or sort spill, cache hit ratio, and slow log counts. Alert on plan regressions.
Rollback and fallback
Feature flag the expensive path. Provide a coarse filter or cached summary if the plan degrades. Write the command to drop or revert an index safely.
Owner and cadence
Name the owner. Review this sheet quarterly or when row counts double.

🔍 Example: Product listing with filters and price sort

Scope
GET /products with category and availability filters, sort by price ascending, 50 per page. Target p95 250 ms at 200 RPS.

Context
Products table with 5M rows. Queries filter by category_id and in_stock = true. Sorts by price. Traffic is bursty around sales.

Using the sheet

Query shape. WHERE category_id = ? AND in_stock = true ORDER BY price LIMIT 50
Data shape. Category cardinality is high. in_stock mostly true but varies by event.
Index plan. The composite index (category_id, in_stock, price) includes small columns used in the list card.
Explain and validate. Confirm index range scan on category_id, in_stock and order by satisfied without extra sort. Estimated rows align with reality.
Write the overhead budget. Accept +8 percent on updates to price and stock during sales. If higher, consider a partial index on in_stock = true.
Pagination. Keyset using (category_id, in_stock, price, id) with a last_seen tuple.
Parameter rules. Prepared statements. Monitor for category hot spots that may skew the plan.
Observability. Tiles for p95, rows read vs returned, temp sort usage, and slow logs.
Rollback. Flag to switch to cached “popular items” card if p95 exceeds 400 ms for 10 minutes.
Owner and cadence. Merch team owns it. Review before major sales and when the product count grows by 50 percent.

Success
No external sort. No temp file spikes. Rows read stay close to rows returned. p95 holds under 250 ms during sale traffic.

Small confession
If the product card bloats, I sometimes drop “covering” for a narrower index and accept one bookmark lookup. That is a tradeoff. We keep the SLO.

✅ Do this / avoid this

Start from access patterns and SLOs.
Create composite indexes that match filters first, then sort.
Use keyset pagination for deep pages.
Explain plans on production-like data and store a baseline.
Watch rows read vs returned and temp sort usage.

Avoid

SELECT * on hot paths.
Leading wildcards or functions on indexed columns that block index use.
Deep offset pagination that forces the engine to walk past large ranges.
Blindly adding indexes without a write cost budget.
Trusting the ORM to pick efficient queries on its own.

🧪 Mini challenge

Goal: make one high-traffic query predictable today.

Pick one route with user-visible latency. Write its p95 target and QPS.
Capture the exact SQL the ORM emits. Paste it in your sheet.
Design one composite index for the filters and sort.
Run EXPLAIN on a production-like dataset. Confirm the index path and no external sort.
Switch to keyset pagination if you page deep. Keep offset capped for now.
Add two tiles. Rows read vs returned and temp sort usage. Share the screenshot.

Timebox 40 minutes. Good enough beats perfect.

🎯 Action step for this week

Inventory the top five queries by traffic and user impact.
Create a Query and Index Decision Sheet for each.
Add EXPLAIN checks to CI with a guardrail that fails when a key plan regresses.
Define a write overhead budget and a rollback path for each new index.
Put a “query health” section on your main dashboard. p95, reads vs returned, slow logs.

By the end of the week, aim to have two hot routes using composite indexes with verified plans and a visible baseline.

👋 Wrapping up

Design for the SLO. The access path is the design.
Index how you filter and order. Cover if it helps the SLO.
Prefer keyset over deep offsets. Keep rows read close to rows returned.
Verify with explanations and dashboards. Do not trust vibes.

If you liked this, you will probably enjoy my free 5-day email course, “From Developer to Architect.”
Five short lessons on mindset, tradeoffs, and communication you can use this week.
https://www.techarchitectinsights.com/from-dev-to-architect-5-day-email-crash-course

I would love your input. What database topic or tool would help you most right now
Hit reply and tell me.

Thanks for reading.

See you next week,
Bogdan Colța
Tech Architect Insights