👋 Hey {{first_name|there}},

Why this matters / where it hurts

You know this conversation. Support calls. "The order exists, but Search can't find it." Product frowns. "It looks broken." Engineering shrugs. "Give it a minute."

Nobody likes that answer.

I used to call this a bug. Something technical broke. The broker dropped a message. The consumer lagged behind. The index went stale. So we patched it, retries here, logs there, and a manual reindex button for emergencies. It worked until the next fire.

Here's the real problem: we're forcing ACID behavior across networks that weren't built for it. Some teams grab for 2PC. Others want "one big transaction." The system fights back with locks that freeze, latency that climbs, and coupling so brittle it shatters during deployments. Other teams accept eventual consistency but implement it carelessly, losing events when load spikes or brokers fail, and suddenly you're drowning in the worst of both worlds, stale reads everywhere, and missing facts you'll never recover.

This lesson offers calm. Not perfection. Practicality. We'll treat consistency as an architecture choice you make deliberately for each user flow. Then we'll implement eventual consistency the right way with the Outbox Pattern, guaranteeing that an event is never "maybe published," not even when your broker decides to take an unscheduled vacation.

🧭 Mindset shift

From: "Data mismatch is a bug. Make systems agree instantly."
To: "Consistency is a product and architecture choice. Make propagation reliable."

Want Search and Orders to agree instantly? You're asking for distributed transactions. That means coordinating locks across services or emulating coordination with synchronous calls that pretend to be atomic. Latency rises. Failures amplify. Deployments slow to a crawl.

Eventual consistency isn't surrender. It's a tradeoff. You choose it. The trick is making the propagation path so reliable that "eventual" never becomes "sometimes never."

Two rules that keep you out of trouble

Decide the consistency level per flow. Design the UX and API contract around it.

Publish events transactionally with the write. Otherwise, you'll lose them at the worst possible moment.

🧰Tool of the week: Outbox Implementation Sheet

Keep this single page beside any service publishing domain events.

Scope the flow and consistency goal

Name the writer who creates truth. Name the readers who can lag. Example: "OrderCreated drives Search indexing within 30 seconds."

Define the domain event contract

Event name. Version. Required fields. Idempotency key. Include an aggregate ID and a monotonically increasing version if your system supports it.

Write to the Outbox in the same DB transaction

When you commit the business change, insert an outbox row in the same transaction, same atomic operation, same fate. If the transaction commits, the event exists. If it rolls back, the event doesn't.

Outbox table schema defaults

Minimum columns: event_id, aggregate_id, event_type, payload, occurred_at, status, attempt_count, next_attempt_at. Add schema_version and trace_id if your observability stack demands them.

Relay worker publishes from Outbox

A background worker reads pending rows. It publishes to the broker. It marks rows as sent only after the broker acknowledges receipt. If the broker is down, rows stay pending, patient, persistent, waiting.

Retry policy with backoff and jitter

Retries live in the relay. Use exponential backoff with jitter to avoid thundering herds. Cap attempts. Route poison events to a quarantine table or dead-letter topic, and trigger an alert so someone with coffee and context can investigate.

Consumer idempotency rule

Consumers must be idempotent. Per event_id. Or per aggregate_id plus version. Store processed IDs in a tracking table, or design your updates to be safe when reapplied, upserts instead of inserts, calculations that tolerate duplicates.

Ordering and partitioning decision

Does order matter? Partition by aggregate_id. Preserve ordering within that partition. If order doesn't matter, state it explicitly in your documentation so future engineers don't waste days debugging phantom race conditions that don't exist.

Observability and alerts

Dashboards need these metrics: outbox depth, oldest pending age, publish error rate, retry attempts, consumer lag, and dedupe hits. Alert on "oldest pending age" when it crosses your consistency goal; that's when users start noticing something feels off.

Escape hatch

Build a safe replay mechanism. Re-publish from the outbox by time range or aggregate ID. Document who can run it. Explain how you prevent duplicate side effects. Test it before the incident, not during.

🔍 Example: Orders vs Search mismatch

Scope

Order service writes orders. Search service indexes order summaries for support and internal tooling. Users expect orders to appear in Search within 30 seconds. Instant isn't required.

Context/architecture

Order service uses a relational database. Kafka carries events (any broker works). Search service consumes and writes to its index store. Broker outages happen occasionally, maybe monthly, maybe weekly. Deploys happen every week.

Step-by-step using the sheet

  • Define the goal: "Search shows new orders within 30 seconds, 99 percent of the time."

  • Event contract: OrderCreated v1. Include event_id, order_id, customer_id, created_at, and the minimal fields Search needs for its job. The idempotency key is event_id.

  • Transactional write: In the same transaction that inserts the orders row, insert an outbox_events row for OrderCreated, one atomic unit, one destiny.

  • Relay worker: Poll pending rows every second. Publish each one. Mark it sent only after the broker confirms. If the publish fails, leave it pending. Try again.

  • Retry: Exponential backoff with jitter. If the broker goes dark for 10 minutes, outbox depth grows like a patient queue, but nothing vanishes into the void.

  • Consumer idempotency: Search stores processed event_id values in a tracking table, or it uses an "upsert by order_id and version" strategy that makes duplicate deliveries harmless, even beneficial.

  • Ordering: Partition events by order_id if future updates must apply in sequence, because out-of-order updates to the same aggregate can create bizarre states that confuse users and make debugging feel like archaeology.

  • Observability: Watch "oldest pending outbox age" like a hawk. If it crosses 30 seconds, you know Search looks stale to users right now. That's an operational signal you can act on, not a mystery you have to divine from scattered complaints and angry Slack messages.

  • Escape hatch: A replay endpoint that republishes outbox rows for the last hour. Use it when a consumer bug gets fixed, and you need to reprocess clean data. Document the runbook. Rehearse it quarterly.

What success looks like

The broker hiccups. Orders keep accepting writes. Search goes stale briefly, 30 seconds, maybe a minute. Then it catches up. No manual "resend event" scripts typed frantically in production. No missing orders forever. No midnight postmortems about data that evaporated.

Small confession

When teams first add an outbox, they forget to alert on "oldest pending age." They watch depth instead. Depth is noisy. Age tells you what users actually experience.

Do this / avoid this

Do

  • Choose consistency per flow. Write down the expectation in seconds or minutes.

  • Insert the outbox row in the same DB transaction as the business write.

  • Publish asynchronously. Use a relay that retries with backoff and jitter.

  • Make consumers idempotent. Make them safe under duplicates.

  • Alert on the oldest pending outbox age. Not only queue depth.

Avoid

  • Trying to force ACID across services. 2PC-style coordination for core flows kills velocity.

  • Publishing events "after commit" in a best-effort way from the request thread.

  • Treating event loss as acceptable. "We can reindex later" is a lie you tell yourself before the incident.

  • Writing consumers that break on duplicates. Or break on out-of-order delivery.

  • Building a system where recovery means manual SQL plus prayer.

🧪 Mini challenge

Goal: ship a reliable event path for one write today.

  • Pick one write operation that currently triggers a best-effort publish.

  • Add a minimal outbox table. Write the outbox row in the same transaction.

  • Implement a simple relay loop. It publishes. It marks are sent only on acknowledgment.

  • Add idempotency in one consumer. Track event_id.

  • Add one dashboard tile: oldest pending outbox age.

  • Simulate a broker outage for 5 minutes. Verify no events are lost.

If you try this, hit reply. Tell me one thing that surprised you.

🎯 Action step for this week

  • Define consistency goals for your top 3 cross-service flows. In seconds. Or minutes.

  • Standardize an outbox table schema. Standardize a relay worker pattern across services.

  • Add idempotency rules to consumers as a team standard, not a per-service hobby project that half the team ignores.

  • Add "oldest pending outbox age" to your main reliability dashboard, right there with error rates and latency percentiles, where it belongs.

  • Create a replay procedure. Assign an owner. Document permissions. Rehearse it once so it doesn't feel like defusing a bomb during an incident.

By the end of this week, aim to have one production service publishing via an outbox with alerts that reflect the user experience, not just system metrics that engineers find interesting.

👋 Wrapping up

Consistency is a choice per flow. Not a bug you fix later.

Avoid distributed ACID. It buys locks. It buys latency.

Outbox makes event publication reliable. Even when the broker is down.

Monitor age. Make consumers idempotent. Keep a replay escape hatch.

⭐ Most read issues (good place to start)

If you’re new here, these are the five issues readers keep coming back to:

What consistency problem is hurting your team most right now?
Hit reply and tell me in one sentence.

Happy New Year! 🎉

Here's to 2026. A year for building systems that scale, yes, but also for building businesses that grow, teams that thrive, and making the kind of impact that actually matters. Whether you're preparing to scale up, level up, or just ship something you're proud of, I'm excited to be on this journey with you.

Cheers to what's ahead,
Bogdan Colța
Tech Architect Insights

Keep Reading