I've been interviewing backend engineers for the last few months, and I noticed a pattern. A painful one.
I ask a question that seems standard—almost boring—and watch as 8 out of 10 smart, capable developers walk right into a trap.
The question isn't a riddle. It's not "invert a binary tree on a whiteboard." It's a real-world problem we faced last month.
Here it is:
"Design a Rate Limiter for our new AI Agent API."
"Easy," they say. "I'll use Redis. Token Bucket algorithm. 100 requests per minute. Next question?"
And that's where they fail.
The "AI" Twist
If this were a standard REST API where requests take 50ms, the Token Bucket answer would be perfect. But this is an AI Agent API.
Here's the context I give them next:
- Requests are slow. Generating a response can take 30-60 seconds.
- Cost is variable. One request might use 10 tokens (cheap); another might use 10,000 tokens (expensive).
- Concurrency matters. If a user sends 100 requests instantly, and each takes 60 seconds, you're holding 100 open connections. Your server memory will explode before you even hit the "rate limit."
The Trap
Most candidates optimize for throughput (requests per second). But for LLM apps, you need to optimize for concurrency (active requests) and cost (token usage).
If you just limit "10 requests per minute," a user could send 10 massive prompts simultaneously, lock up your worker threads for a full minute, and cost you $5 in a single burst.
The Solution (That 20% Got Right)
The candidates who passed didn't just throw "Redis" at the problem. They asked about the workload.
Here's the simple, robust solution we were looking for:
1. The "Waiting Room" (Leaky Bucket)
First, we need to protect our servers from exploding. We use a Leaky Bucket queue.
- Requests enter a queue (Redis List or SQS).
- Workers pull requests at a fixed rate (e.g., 5 concurrent jobs max).
- If the queue is full, we reject immediately (429 Too Many Requests).
2. The "Wallet" (Token Bucket)
Second, we need to protect our bank account.
- We don't limit requests. We limit compute units.
- Each user has a "wallet" of points per minute.
- Before processing, we estimate the cost. If they have enough points, we proceed.
Why This Matters
The reason this question trips people up isn't because they don't know System Design. It's because they're on autopilot.
They hear "Rate Limiter" and their brain auto-completes to "Redis Token Bucket."
In 2025, the best engineers aren't the ones who memorized the "Cracking the Coding Interview" book. They're the ones who pause, look at the specific constraints of this problem, and realize that an AI Agent is a very different beast than a CRUD app.
Takeaway: Next time you're in an interview, don't rush to the solution. Fall in love with the problem first. Ask about the latency. Ask about the cost.
That's how you pass the 80% fail rate.
