👋 Hey {{first_name|there}},
You inherit a system, the docs are wrong, and someone wants a migration plan by Friday. This is how I learned to assess what's actually running before I commit to changing any of it.
Why this matters / where it hurts
You join a new team. There's a diagram pinned in the onboarding wiki. Four services, tidy arrows, everything in its right place. Two weeks in, you find a batch job called nightly_sync_v2 that's been running for four years. Nobody currently on the team remembers writing it. Somebody mentions in passing that if you disable it, the report the VP of Sales reads on Monday morning stops working. That's all the documentation there is.
That's brownfield. And in my experience, it's most of what architect work actually looks like. Green-field rewrites are the exception. The real job is almost always changing a system that's already running, already has users, and already has somebody depending on a behaviour nobody documented.
The failure pattern is depressingly consistent. A migration gets scoped off the wiki diagram. A timeline lands in a roadmap deck. Work starts. Then, slowly, the plan meets the system. An undocumented consumer shows up in the logs. A webhook starts failing quietly for a week before anyone notices. A year later, the migration is technically "done-ish," but the old system still can't be turned off because three things that were never in the plan still depend on it. You've now paid for the new system, kept paying for the old one, and shipped very little clarity to anyone.
In Lesson #47 on the Modular Monolith decision framework, we looked at choosing the shape of a system you're designing. This is the harder question: what do you do with the shape you already have, when the shape on paper isn't the shape in production?
🧭 The shift
From: We'll migrate this system to the new architecture.
To: We'll map what's actually running before we commit to anything.
The strangler fig pattern works, but only against an honest picture of the existing system. Textbooks show a clean old system gradually replaced by a clean new one. Production doesn't look like that very often. It has cron jobs nobody owns, a reporting database a BI tool reads directly, a service that was "decommissioned" last year but still gets traffic on Tuesdays.
When a migration stalls halfway through, you get the worst combination of both systems running at once, each carrying part of the truth. That's similar in feel to what we covered in Lesson #34 on the Distributed Monolith, except the mess here is historical rather than architectural.
A few defaults I'd offer, cautiously:
Every architecture diagram older than six months should be treated as a hypothesis. Worth checking. Not worth building a roadmap on top of.
The real dependency graph lives in production telemetry, access logs, and network traffic. The wiki tells you what somebody intended once. Production tells you what exists today.
Find the parts nobody owns, monitors, or tests before you promise to replace any of them. That's where most of the timeline overrun I've seen actually comes from.
🚀 Want the full architecture roadmap?
If you found this useful and you're not subscribed yet, I built something that might be worth your time. It's a free 5-day email crash course designed specifically for developers moving into architecture roles. One lesson per day, short enough to read over coffee, practical enough to apply the same week.
It covers the foundational shifts that most developers don't get taught: how to think in tradeoffs instead of "best practices," how to communicate technical decisions to non-technical stakeholders, and how to spot the architectural problems that don't show up until production traffic hits. Basically, the stuff I wish someone had walked me through when I made that transition myself.
No fluff, no upsell at the end. Just five days of focused, experience-based lessons.
🧰 Tool of the week: Legacy System Assessment Card
Eight questions to run before you draft a migration plan. One page. The goal is to replace assumptions with evidence before a date gets committed.
What does production actually do today? Pull a week of traces, request logs, and query patterns. Compare them to the diagram. Write down every mismatch, even the small ones.
Who calls this system, and who does it call? Build the dependency graph from observability data, not from memory. Include sync callers, async subscribers, and scheduled jobs.
What hidden integrations exist? Cron jobs, batch exports, webhook subscribers, direct database readers, mirrored tables, files dropped on an SFTP somewhere. These rarely appear in diagrams, and they are almost always load-bearing.
What's the real data model? Foreign keys that were never declared, soft-delete flags, shadow columns, and tables not written to in three years but still read nightly. Check for them.
Which parts change often, and which have been quiet for years? Commit history per module is a cheap signal. Quiet parts can usually wait. Active parts are where the pain already is.
Who owns each piece, and where does the tribal knowledge live? Put a name next to every component. If you can't, that's the first conversation to schedule, before any migration talk.
What hurts operationally right now? Incidents per month, on-call frequency, known workarounds, the unwritten list of "don't touch this on Fridays" rules.
What's the business runway? Time, budget, and risk tolerance. A two-year migration plan against a one-year runway is a project that will be abandoned, just later and more expensively.
Run this with the people who actually work on the system in the room. The boring questions are the ones where surprises usually show up.
🔍 In practice: Inheriting a customer onboarding platform
Scenario: A mid-sized SaaS company. A new team lead inherits a customer onboarding platform flagged for rewrite. The diagram shows a monolith and two supporting services.
Scope: Assessment first. No migration commitment until the card is complete.
Context: Team of six, most of the original authors gone. On-call averages roughly one incident a week in this area.
A week of production traces turned up eleven distinct callers of the monolith, not four. Three of them were internal tools the current team had never heard of.
Query logs showed a reporting database being read directly by a BI tool with no service layer in front of it. That had quietly coupled the schema to a dashboard that someone senior checked every Monday.
A nightly job called
cleanup_v3was deleting records based on rules in a config file, in a repo with no active maintainers. Nobody was sure what "done" meant for that job.Two of the three services on the diagram turned out to be the same service deployed twice, with slightly different configs. No one had merged them. No one had noticed.
The tradeoff we accepted: We decided not to migrate the reporting database in phase one. The strangler fig script says you should eventually move everything, but this read path was stable, low-risk, and nobody on the business side was asking. We wrote an ADR for the deferral. It's still running on the old database a year later, and honestly, I think that's fine.
Result: The migration took about 14 months against an original 8-month estimate. We had two minor incidents during the cutover, both in the reporting path we'd deferred, neither customer-visible. No rollbacks, but one unplanned two-week pause when we discovered a fourth internal consumer mid-way through. The 8-month estimate had been built off the 4-service diagram, so in fairness it was never really a number. It was a guess.
✅ Do this / ❌ Avoid this
Do this:
Map production from telemetry, traces, and logs before any timeline leaves the room.
Write an ADR for every piece you are choosing not to migrate in each phase. Deferred work is still a decision, and the next person needs to know it was chosen, not missed.
Look for the silent consumers - tools, jobs, and teams reading your database or listening to your events without appearing in a single diagram.
Re-run the assessment card every quarter during a long migration. Plans drift. The drift is fastest at exactly the point where everyone stops checking.
Avoid this:
Taking the previous architect's diagram as ground truth. It was probably right when it was drawn, and that's the problem.
Committing to a date before you have evidence of the current state.
Treating the strangler fig as a "replace all of it in order" pattern. It isn't. It's "replace the parts that hurt, leave the parts that don't, and document the choice."
Dismissing cron jobs and batch pipelines because they don't show up as synchronous traffic. They're usually the most load-bearing and the least owned code in the system.
🧪 Mini challenge
Goal: In one 45-minute session, produce a gap list between your current architecture diagram and what's actually running in production.
Pick one system you think you know well. Open its diagram.
Pull 7 days of access logs, request traces, or network flow data for it.
List every caller, subscriber, and downstream dependency that actually showed up.
Compare to the diagram. Write down the gaps going both ways: things in production that aren't in the diagram, and things in the diagram that aren't in production.
Try it and hit reply. I'd be curious what you found. The gap is usually bigger than people expect, including me.
🎯 This week's move
Pull production telemetry for one system you own, and compare it honestly to the diagram your team shows during onboarding.
Find one silent consumer. A tool, a job, a team, anything that depends on your system without being in any diagram.
Run questions 1 and 2 of the assessment card on a system your team is planning to change this quarter.
If you're in a migration conversation this week, ask one question out loud: what evidence are we using to say we understand the current state?
By the end of this week, aim to: identify at least one integration, consumer, or dependency that isn't in your team's current architecture diagram.
👋 Wrapping up
Every diagram is a hypothesis until production confirms it.
Migrations rarely fail because the new system was wrong. They fail because the old system wasn't what anyone thought it was, and the plan was built on the wrong picture.
Document what you are choosing not to migrate. That's the part that always gets lost, and the part the next person will curse you for if it isn't written down.
Help a friend think like an architect
Know someone making the jump from developer to architect? Forward this email or share your personal link. When they subscribe, you unlock rewards.
🔗 Your referral link: {{rp_refer_url}}
📊 You've referred {{rp_num_referrals}} so far.
Next unlock: {{rp_next_milestone_name}} referrals → {{rp_num_referrals_until_next_milestone}}
View your referral dashboard
P.S. I’m still working on two new rewards. If there’s something you are interested in, let me know 😉
⭐ Good place to start
I just organized all 40 lessons into four learning paths. If you've missed any or want to send a colleague a structured starting point, here's the page.
Thanks for reading.
See you next week,
Bogdan Colța
Tech Architect Insights