👋 Hey {{first_name|there}},

Your APIs are about to get hammered by callers that don't read docs, don't respect backoff hints, and never close the tab. Here's how to keep your system standing without locking out your real users.

Why this matters / where it hurts

A few months ago, a team I know noticed something weird in their traffic dashboards. Request volume to the catalog API had jumped about 40% over two weeks. No new feature launch, no marketing push, nothing seasonal. Just this steady, almost mechanical climb in authenticated calls to search and detail endpoints. And it was spread evenly across the day. No peaks at lunchtime, no dips overnight. Just... flat and relentless.

Turns out it was an AI agent. A partner had quietly shipped an LLM-powered assistant that pulled product data to answer customer questions. The agent had valid API keys, so it was technically authorized. But the thing was calling search in tight loops, firing off dozens of near-identical queries within seconds because the LLM's reasoning chain kept re-fetching data it had already seen two steps earlier. The rate limiter, built for human browsing patterns, didn't even blink. By the time the team actually caught it, the catalog database was running hot enough to drag down checkout latency for real customers.

I'm not pulling this from a conference talk or a thought-leadership blog. This is the shape of traffic that's heading your way, maybe already there. AI agents, built by your org, by partners, or by someone scraping your public endpoints, behave nothing like human users. They don't stop to read. They don't get tired at 2 a.m. They retry aggressively because the orchestration framework said to. And most of your current API design, rate limiting, and monitoring? It assumes the thing on the other end is a person clicking buttons in a browser. We covered in Lesson #37 how cascading retries from your own services can take a system down. Now picture that same retry pressure, except it's coming from code you didn't write and can't patch.

🧭 The shift

From: "Our API consumers are applications built by developers who read our docs and follow our conventions."
To: "Our consumers increasingly include autonomous agents that discover behavior through trial and error, at machine speed, with no human watching."

This changes what "well-behaved client" even means. When a human developer integrates your API, they've probably read the rate limit section at least once. They handle 429s, they build in some backoff, maybe because they got burned on a previous project. That's the normal case.

An AI agent's integration code might have been generated by an LLM that saw something vaguely similar in its training data. The agent doesn't understand your domain model. It understands tokens in and tokens out. If the orchestration logic decides the previous response was incomplete, it'll call the same endpoint again. And again. Seventeen times if that's what the loop produces. Return a generic 500? It retries harder, because that's what every retry tutorial on the internet taught the LLM to suggest.

The answer isn't to block agents outright. For most public-facing systems, that ship sailed a while ago. It's to design your APIs so they degrade gracefully under non-human traffic while staying responsive for the humans.

  • Treat non-human identity as a first-class concept in your auth layer. Not a tag someone adds to a spreadsheet after the incident.

  • Build rate limiting around behavioral signatures, not just raw request counts. Thirty calls to the same endpoint in ten seconds with slight parameter tweaks looks nothing like a user browsing, and your limiter should know that.

  • Make error responses machine-legible. If an agent can't extract a retry-after value from your 429, it's going to guess. It'll guess wrong.

🚀 Want the full architecture roadmap?

If you found this useful and you're not subscribed yet, I built something that might be worth your time. It's a free 5-day email crash course designed specifically for developers moving into architecture roles. One lesson per day, short enough to read over coffee, practical enough to apply the same week.

It covers the foundational shifts that most developers don't get taught: how to think in tradeoffs instead of "best practices," how to communicate technical decisions to non-technical stakeholders, and how to spot the architectural problems that don't show up until production traffic hits. Basically, the stuff I wish someone had walked me through when I made that transition myself.

No fluff, no upsell at the end. Just five days of focused, experience-based lessons.

🧰 Tool of the week: Agent-Readiness Audit

Agent-Readiness Audit: Check whether your API can handle non-human callers

  1. Identity classification - Can your auth system tell the difference between a human session token and an agent or service token at the middleware level? If every caller looks identical to your rate limiter, you can't apply different policies. Check whether your key provisioning even supports a "caller type" field right now.

  2. Rate limit segmentation - Do you have separate rate-limit tiers for agent traffic versus interactive traffic? One shared global limit means a single aggressive agent can eat the whole budget and starve your real users. You want at least two tiers: human-interactive gets higher burst but maybe lower sustained ceiling, agents get lower burst with stricter throttling on sustained volume.

  3. 429 response quality - Does your 429 actually include a machine-parseable Retry-After header with a value in seconds? Go test this yourself. Hit your own rate limit on purpose and look at what comes back. If the body is just {"error": "rate limited"} with no timing information, every agent that hits it will fall back to its own retry schedule. That schedule is almost certainly more aggressive than what you'd choose.

  4. Idempotency enforcement - Do your write endpoints accept and enforce idempotency keys? This one matters a lot. Agents retry POST requests regularly because the orchestration layer lost track of whether the first call actually went through. No idempotency means duplicate records, double charges, quietly corrupted state.

  5. Request cost signaling - Can a caller tell that one endpoint costs 100x more compute than another? A lightweight list call and a heavy aggregation query both return 200, but they're not the same thing at all. Agents have zero intuition here. Consider adding X-Request-Cost headers or at minimum documenting compute tiers somewhere in your OpenAPI spec.

  6. Behavioral anomaly detection - Can your observability pipeline spot repetitive-pattern traffic from a single caller? An agent stuck in a retry loop has a distinctive fingerprint: same endpoint, nearly identical payloads, sub-second intervals. But if your monitoring only tracks aggregate volume, all you'll see is a spike. You won't know it's one key doing it. Per-caller pattern detection is what makes the difference.

  7. Graceful degradation path - If agent traffic doubled tomorrow morning, what breaks first? Walk the dependency chain: gateway, app servers, database connection pool, downstream services. Find the bottleneck. In Lesson #39 on bulkhead architecture, we talked about isolating failure domains. Same idea here. Agent traffic should hit its own resource boundary before it gets anywhere near your critical user-facing paths.

🔍 In practice: The partner bot that quietly doubled your database load

Scenario: Your team runs a logistics platform. There's a shipment-tracking API. A partner builds an AI customer support bot that answers "Where's my package?" by calling your tracking endpoint. The bot goes live and nobody on your side knows about it. Within a week, tracking traffic is up 2x and your primary read replica is running out of connections during peak hours.

  • Scope: Tracking API, rate limiting layer, and read replica pool. We're not redesigning the whole API gateway here.

  • Context: Six-person team. The API was originally built for a dashboard UI that polls every 30 seconds per active user. The partner's bot polls on every conversation turn. Sometimes that's 5 to 10 calls inside a single customer chat session.

  • Step 1 - Find the caller: Pull per-API-key traffic breakdowns. One key jumps out immediately, it's responsible for 55% of all tracking calls. That's the partner's production key.

  • Step 2 - Tag the identity: Add caller_type: agent to that key in the gateway config. This is a metadata change. Fifteen minutes, no deploy needed.

  • Step 3 - Split the rate limits: Set agent tier to 10 requests per second sustained, burst of 20. Human-interactive tier stays at 50 burst. The partner bot still works, but it can't crowd out everyone else.

  • Step 4 - Fix the 429: Add a proper Retry-After header in seconds plus a reason field. Here's the thing I didn't expect: the partner's bot framework actually respected Retry-After once it was present in the response. Most of them do. The problem wasn't that agents are badly built. The problem was that we weren't giving them the information they needed to behave.

  • Step 5 - Per-caller dashboard: Group the monitoring view by caller_type. Alert if any single agent key exceeds 70% of its tier budget for more than five minutes straight.

  • The tradeoff we accepted: We didn't build a dedicated agent gateway or separate infrastructure. That's probably the right long-term architecture, but it's a full quarter of work at minimum. The identity tagging plus segmented rate limits gave us maybe 80% of the protection in about a day. We decided to revisit the full isolation when agent traffic crosses 40% of the total volume. Honest answer: We might push that deadline when it arrives. But the stopgap is holding.

  • Result: Read replica connection usage dropped from 92% to 61% during peak hours. Partner bot kept running. No customer-facing latency impact after the change.

Do this / Avoid this

Do this:

  • Issue separate API credentials for every agent integration, even your own internal ones. You cannot manage what looks identical in your logs.

  • Return structured error responses with explicit retry guidance in every 4xx and 5xx. Write them assuming the caller is code that will parse the JSON, not a person reading a sentence.

  • Track per-caller request patterns. Aggregate volume dashboards hide the fact that one agent key is responsible for half your traffic to a single endpoint.

Avoid this:

  • Trusting that your per-user rate limits will hold up against agent traffic patterns. Agents don't have browsing sessions that end when someone closes a laptop. Their "session" runs until someone redeploys the bot or the API key expires.

  • Returning only human-readable error text. "Please try again later" tells an LLM orchestrator nothing useful. It needs a number, a unit, and ideally a reason code.

  • Panicking and blocking all bot traffic after an incident. You'll break partner integrations you didn't know existed and push agent builders toward unauthenticated scraping, which is harder to manage in every way.

🎯 This week's move

  • Pick one externally-facing API and run through the first four audit items. Identity classification, rate limit segmentation, 429 quality, idempotency. Takes about an hour if the codebase is familiar.

  • Check your rate-limit 429 response specifically. Does it include Retry-After? If not, adding that single header prevents the most common agent-driven retry storm. It's a small change with outsized impact.

  • Pull per-API-key traffic data for the last 30 days. Look for any key responsible for more than 20% of requests to a single endpoint. If you find one, dig in. It might be an agent integration nobody told you about.

By the end of this week, aim to: Have a clear, documented answer to "Can our system distinguish between a human user and an AI agent?" for at least one critical endpoint. If the answer is no, file the ticket to add caller-type tagging. Don't let it sit in your head as a "we should probably do that."

👋 Wrapping up

Your APIs were designed for humans. People who browse for a while, get distracted, close the tab, come back tomorrow. The new callers don't do any of that. They run continuously, retry without hesitation, and scale to whatever the orchestration layer decides.

The teams that handle this well won't be the ones that try to block agents entirely. They'll be the ones that saw it coming, set up separate lanes, and kept the human experience fast while giving the machines clear rules and hard limits.

Help a friend think like an architect

Know someone making the jump from developer to architect? Forward this email or share your personal link. When they subscribe, you unlock rewards.

🔗 Your referral link: {{rp_refer_url}}

📊 You've referred {{rp_num_referrals}} so far.
Next unlock: {{rp_next_milestone_name}} referrals → {{rp_num_referrals_until_next_milestone}}

View your referral dashboard

P.S. I’m still working on two new rewards. If there’s something you are interested in, let me know 😉

⭐ Good place to start

I just organized all 40 lessons into four learning paths. If you've missed any or want to send a colleague a structured starting point, here's the page.

Thanks for reading.

See you next week,
Bogdan Colța
Tech Architect Insights

Keep Reading