Fat Events, Thin Events, and the Bill Nobody Saw Coming

👋 Hey {{first_name|there}},

We shipped event payloads that were too thin, fixed it by making them too fat, and learned the hard way that what goes in the event is a coupling decision, not a producer convenience. Here's the decision sheet I wish I'd had.

Why this matters / where it hurts

Design review, late afternoon, the kind where everyone has one foot out the door. Someone said "just put everything in the event, it's easier." I remember thinking there was something off about that, and I remember not saying anything, because the meeting was already running long and I wanted to leave. So we shipped it.

A few weeks later the platform team pinged our channel. Storage costs on one topic had grown by something they politely called "noticeable." Most of what was sitting in those partitions was duplicate user data nobody on the consumer side actually read. Then legal asked, separately, why personal data was showing up in every consumer's log retention. That conversation went how you'd expect.

The annoying part: we'd over-corrected to get there. The original design had been the opposite extreme, a thin notification event that forced every consumer to call back to the source service for details. That had been melting the user service under fan-out queries. So we'd swung the other way without really thinking about the middle.

In Lesson #35 we covered publishing events reliably with the outbox pattern. That matters. What you publish matters more, and it's the decision most teams skip past because the payload shape feels obvious in the moment. It isn't. It connects directly to Lesson #30 on data contracts, which is about evolving payloads safely once they're in the wild and you can't take them back.

🧭 The shift

From: Put everything in the event so consumers don't have to ask twice.
To: Include what the consumer needs for its next decision. Let it fetch the rest if it cares.

The mistake is treating payload shape as a producer-side convenience. It isn't. Consumers pay for it in compute and coupling, and your ops budget pays for it in storage and throughput. Producer convenience is the smallest of those costs, and somehow it's almost always the one that wins the meeting.

Thin events shove load back onto the source service through callback queries. Sometimes that's fine. Sometimes it's a slow self-inflicted DDoS. Fat events go the other direction and push cost into storage, log retention, and schema evolution pain. Different failure mode, equally annoying once it shows up on a dashboard.

A few defaults I now hold firmly:

Stable identifiers (userId, orderId, tenantId) go in every event. No exceptions worth arguing about.
Mutable state goes in only when you've actually measured that consumers use it. Guessing here is how bills run away from you.
Schema versions explicit, mismatches fail loud. The alternative is silent rot for six months.

🧰 Tool of the week: Event Payload Design Decision Sheet

Use it before shipping a new topic. Use it again in quarterly topic reviews, which you probably aren't doing and probably should be.

Name the consumer's decision. What action does this event trigger on the receiving side? Updating a read model usually wants the state in the payload. A workflow that's going to fetch its own context anyway probably doesn't. If you can't say in plain language what the consumer is going to do with the event, hold off on designing the payload until you can.
Map the fan-out before you commit. Count current consumers and roughly estimate where you'll be in a year. Ten services calling back on a thin event isn't a design, it's a query storm with extra steps. Measure in staging before prod.
Size-budget the topic. Average payload times events per day times retention days. If that number makes you flinch, redesign before shipping. Redesigning under billing pressure is a bad time.
Pin the schema version. Every event carries an explicit schemaVersion. Consumers validate it and fail loud on mismatch. The reason: implicit contracts tend to be fine right up until they aren't, usually during a release that has nothing to do with events.
Separate reference from state. IDs go in every time. Mutable state goes in when there's evidence that consumers actually use it. The rest can sit behind a fetch endpoint that consumers hit only when they need it.
Classify PII on day one. If the event carries personal data, document retention, deletion flow, and which consumer logs need purging. Answer the compliance question before someone asks it for you.
Define the freshness contract. Is the embedded state authoritative at publish time or a snapshot that might already be stale by the time a consumer reads it? Write the expected staleness window down somewhere a consumer can find it.
Plan the fallback. When a consumer sees missing or stale data, what does it do? Refetch from the source, ignore the event, or surface an error. Whichever you pick, make sure it's actually wired up in the consumer, not assumed.

For high-volume topics, get the producer and the two or three biggest consumers in a room once a quarter. Half an hour is enough. It's not a fun meeting, but I've never regretted scheduling one.

🔍 In practice: A profile update topic, redesigned

Scenario: Platform team broadcasting profile updates to a handful of downstream services. The first version was a thin notification, just an ID. The user service started dying from callback queries every time a profile changed. We swung the other way and went fat. Storage costs and PII spread followed. Classic overcorrection, both directions.

Scope: the profile update topic only. We didn't touch auth or account creation.
Context: several consumer services with different needs, mid-sized event volume, week-ish retention.
We built a quick spreadsheet of which consumers actually read which fields. It took longer than expected. A lot of "let me check with my team" Slack threads.
The pattern that emerged: most consumers wanted a small, predictable subset (display name, email, locale). A couple just wanted the ID to invalidate a cache. Most of the payload, the part that was driving cost, wasn't being read by anyone.
We designed a v2 that carried the small subset plus an explicit schemaVersion and updatedAt. Everything else moved behind a fetch endpoint.
The thing we got wrong: we assumed one rarely-used field was dead. It wasn't. A team used it in a daily batch job. We caught that in staging, not prod, and added it back before rollout. Lucky, not skillful.
The tradeoff we accepted: two consumer teams had to add a callback path for fields they used occasionally. One was mid-release, so we waited a few weeks to cut over. That wait felt long. Worth it.
Result: payload size dropped by an order of magnitude. Storage cost dropped meaningfully. Fan-out queries on the source service stayed flat, since we'd kept the most-used fields in the payload. No new errors on the callback path.

✅ Do this / ❌ Avoid this

Do this:

Start from the consumer's use case, not the producer's aggregate boundary.
Version schemas explicitly. Fail loudly on a mismatch.
Always include stable references.
Put a size budget on high-volume topics before they ship.

Avoid this:

Dumping the full domain aggregate into the event because it's convenient for the producer.
Assuming consumers will fetch on demand without measuring what that actually looks like at scale.
Letting PII ride on high-fanout topics without a retention and deletion plan.
Treating payload shape as a one-time decision. It's a living contract.

🎯 This week's move

Pick your highest-volume event topic. Measure average payload size and monthly storage cost. Write both numbers down.
Ask each consumer team which fields they actually read. You will be surprised by at least one answer.
Draft a right-sized v2 of the schema with an explicit version field.
Sequence the rollout around consumer release windows, not yours.

By the end of this week, aim to: have one topic with a measured payload cost and a documented consumer field-usage map.

👋 Wrapping up

Event shape is a coupling decision. Treat it like one. Thin events can turn into query storms. Fat events can show up on the storage bill. Both are recoverable, but only if you've measured what you're actually choosing between before you ship.

The short version: design the payload around what the consumer needs to decide. The producer's aggregate is not the answer.

Help a friend think like an architect

Know someone making the jump from developer to architect? Forward this email or share your personal link. When they subscribe, you unlock rewards.

🔗 Your referral link: {{rp_refer_url}}

📊 You've referred {{rp_num_referrals}} so far.
Next unlock: {{rp_next_milestone_name}} referrals → {{rp_num_referrals_until_next_milestone}}

View your referral dashboard

P.S. I’m still working on two new rewards. If there’s something you are interested in, let me know 😉

⭐ Good place to start

I just organized all 40 lessons into four learning paths. If you've missed any or want to send a colleague a structured starting point, here's the page.

Share the newsletter

Thanks for reading.

See you next week,
Bogdan Colța
Tech Architect Insights