👋 Hey {{first_name|there}},
Why this still blows up at 2 a.m.
I used to treat secret rotation like dental work. Necessary, a bit scary, easy to postpone.
Then “later” arrived at 2 a.m.
Someone flipped a key, some services picked it up, some didn’t, and suddenly we were debugging 401s in a war room promising to “do better next time.”
Rotation felt like a heroic event. Big, stressful, and rare.
This lesson is the calm version. One simple runbook, a predictable order of operations, and a few verification steps so you do not guess. It is not perfect. It is practical.
It also connects cleanly to earlier lessons:
Idempotency makes retries safe while clients reauthenticate.
SLOs tell you how much risk you can spend during a rollout.
Backpressure and brownouts protect the core path if something wobbles.
All we need is a rotation ritual that turns panic into paperwork.
🧭 Mindset shift
Move from:
Rotation as a heroic event you brace for once a year
to:
Rotation as a boring, reversible habit with overlapping keys
Two rules:
You can hold more than one valid credential at the same time.
You rehearse the swap so roll forward and roll back are both easy.
If those are true, rotation gets dull fast. Dull is good.
🧰 Tool of the week: Rotation Runbook (copy and fill)
Keep this as a single page per “secret family.”
If a field feels hard to fill, that is where the risk hides.
Secret scope
Name and purpose.
Examples: JWT signing key, database password for service X, payments provider API key, webhook signature secret.Storage and distribution
Where the secret lives and how consumers fetch it.
Examples: secrets manager path, environment variable names, config map, JWKS endpoint.Consumers and blast radius
Exact services and jobs that read or validate this secret.
Include third parties that verify your signatures or call your webhooks.Format and identification
Key type and versioning.
Always have a key identifier such askidin JWT headers or a version tag in config so consumers can accept multiple keys at once.Rollout order (accept → emit → revoke)
A. Add the new secret and deploy validators that accept both old and new.
B. Verify in logs/metrics that requests using both identifiers are accepted.
C. Switch emitters to the new secret. Old artifacts continue to verify.
D. Hold for an overlap window.
E. Remove the old secret from acceptance and revoke it.Overlap window
Minimum time both secrets are valid.
Pick a number that comfortably exceeds cache TTLs and deploy propagation times.Verification
What you will check and where:Authentication failure rates by client/service
401 / 403 deltas
Provider dashboards for external APIs
Distribution of key identifiers across traffic
Rollback plan
Exact steps to revert if the new secret misbehaves:Which config to change or secret to restore
Which deploy or script to run
Who decides to roll back and based on what signal
Schedule and owner
Cadence (e.g. quarterly for high-value keys, monthly for automation tokens)
Named owner
Next review date
That is it. One page, reused every time.
🔍 Example: Rotating a JWT signing key
Let’s make it concrete.
Scope
Auth service signing key used for user sessions.
Storage
JWKS endpoint backed by a secrets manager.
Format
RSA key pair with kid.
Rollout order
Generate a new key pair and publish it to JWKS with a new
kid.Deploy all verifiers to fetch and cache the full JWKS and accept both
kids.Verify in logs that verifiers have both keys in cache and are accepting tokens signed with either.
Switch the signer to the new key. Old tokens continue to verify using the old key.
Hold for an overlap window, e.g. 2× your JWKS cache TTL.
Revoke the old key from JWKS and watch 401/403 for a few hours.
Verification
Key distribution trends toward the new
kid.Auth failures stay flat.
Sessions keep refreshing normally.
Small confession: I sometimes keep the old key around a bit longer than planned if I know a mobile app is still rolling out. That is a tradeoff. The key is to write it down in the runbook instead of pretending it is not happening.
✅ Do this / avoid this
Do:
Always support at least two valid secrets in every validator or client.
Prefer pull over push. Services fetch from a secrets manager at startup and on a timer.
Propagate a key identifier and log it on success and failure.
Choose overlap windows longer than your longest cache or config TTL.
Automate the “revoke old secret” step so it is deliberate but not manual guesswork.
Avoid:
Flag day swaps where you turn off the old key at the same moment you turn on the new one.
Validators that only accept one key at a time. They force big bangs.
Long caches with short overlap windows. A 10-minute JWKS cache with a 5-minute window is not an overlap.
Secret values in code and logs. Move them to a secrets manager and scrub telemetry.
Rotation without inventory. If you do not know every consumer, you will miss one.
🧪 Mini challenge
Goal: make rotation reversible for one secret today.
Pick one secret with real risk:
A JWT signing key, a database credential, or a provider API key works well.
Then:
Add identification
For JWTs, include a
kidheader.For API keys, add a version tag in config.
Teach acceptance of two keys
Update validators to load a set of valid keys from your secrets source.
Verify they can accept both old and new.
Define the overlap window
Choose a number greater than your longest cache or configuration TTL.
Write the short runbook
Scope, storage, consumers, order, verification, rollback, owner, next review date.
Run a rehearsal in staging
Publish new key, verify both are accepted, switch emitters, then revoke the old.
Capture screenshots of dashboards before/after.
Timebox it to 45 minutes. Good enough beats perfect.
🎯 Action step for this week
By end of this week, aim to:
Copy the Rotation Runbook into your engineering templates.
Inventory three high-value secrets and name an owner for each.
Add a key identifier where it does not yet exist.
Update validators for at least one secret to accept a set of keys from a secrets manager on a timer.
Schedule a real rotation for that secret with a clear overlap window and a rollback plan.
Publish the date and owner in your team channel. Quiet rotations become normal once everyone sees the steps.
👋 Wrapping up
If you remember only three things, keep these:
Support overlapping keys and make the window longer than your caches.
Rotate in the order accept new → emit new → revoke old.
Verify with dashboards, not vibes.
Rotation should feel like changing a light bulb. Quick, predictable, never heroic.
One more thing
If you liked this, you will probably enjoy my free 5-day email course, “Think Like a Software Architect.”
It is 5 short, focused lessons (5–7 minutes each) on mindset, tradeoffs, and communication you can use at work the same week.
You can join here.
See you next week,
Bogdan Colța
Tech Architect Insights