🛑 "429 Too Many Requests" isn't just an error code; it's a survival strategy for your distributed systems. Stop treating Rate Limiting as a simple counter. To prevent crashes, you need the right algorithm. This visual explains the patterns you need to know. 𝐇𝐨𝐰 𝐰𝐞 𝐜𝐨𝐮𝐧𝐭: 1️⃣ Token Bucket: User gets a "bucket" of tokens that refills at a constant rate. Great for bursty traffic. If a user has been idle, they accumulate tokens and can make a sudden burst of requests without being throttled immediately. Use Case: Social media feeds or messaging apps. 2️⃣ Leaky Bucket: Requests enter a queue and are processed at a constant, fixed rate. Acts as a traffic shaper. It smooths out spikes, protecting your database from write-heavy shockwaves. Use Case: Throttling network packets or writing to legacy systems. 3️⃣ Fixed Window: A simple counter resets at specific time boundaries (e.g., the top of the minute). Easiest to implement but suffers from the "boundary double-hit" issue (e.g., 100 requests at 12:00:59 and 100 more at 12:01:01). Use Case: Basic internal tools where precision isn't critical. 4️⃣ Sliding Window Log: Tracks the timestamp of every request. Solves the boundary issue completely. It’s highly accurate but expensive on memory (O(N) space complexity) because you store logs, not just a count. Use Case: High-precision, low-volume APIs. 5️⃣ Sliding Window Counter: The hybrid approach. Approximates the rate by weighing the count of the previous window and the current window. Low memory footprint, high accuracy. Use Case: Large-scale systems handling millions of RPS. 𝐖𝐡𝐞𝐫𝐞 𝐰𝐞 𝐞𝐧𝐟𝐨𝐫𝐜𝐞 6️⃣ Distributed Rate Limiting: Essential for microservices. You cannot rely on local memory; you need a centralized store (like Redis with Lua scripts) to maintain a global count across the cluster. 7️⃣ Fixed Window with Quota: Often distinct from technical throttling. This is business logic—hard caps over long periods (months/years). Use Case: Tiered billing plans (e.g., "Free Tier: 10k calls/month"). 8️⃣ Adaptive Rate Limiting: The "smart" limiter. It doesn't use static numbers but monitors system health (CPU, memory, latency). If the system struggles, it tightens the limits automatically. Use Case: Auto-scaling systems and disaster recovery. 𝐖𝐡𝐨 𝐰𝐞 𝐥𝐢𝐦𝐢𝐭 9️⃣ IP-Based Rate Limiting: The first line of defense. Limits based on the source IP to prevent botnets or DDoS attacks. Use Case: Public-facing unauthenticated APIs. 🔟 User/Tenant-Based Rate Limiting: Limits based on API Key or User ID. Ensures one heavy user doesn't degrade performance for others ("Noisy Neighbor" problem). Use Case: SaaS platforms and multi-tenant architectures. 💡 For most production systems, Sliding Window Counter combined with Distributed Limiting is the gold standard. It offers the best balance of memory efficiency and user fairness. #SystemDesign #SoftwareArchitecture #API #Microservices #DevOps #BackendEngineering #RateLimiting #CloudComputing
Handling API Rate Limits Without Frustration
Explore top LinkedIn content from expert professionals.
Summary
Handling API rate limits without frustration means managing how often users or systems can call an API to prevent overload, downtime, or slow performance, while ensuring access stays fair and reliable. Rate limits are rules that restrict the number of requests allowed over a set period, helping protect backend resources and maintain a smooth user experience.
- Set clear boundaries: Choose an algorithm and enforcement method that fits your traffic patterns, so your API responds smoothly even during spikes or heavy use.
- Communicate limits: Always return proper status codes and helpful headers to inform clients when to pause requests or try again, avoiding unnecessary retries.
- Balance performance: Select the right storage or caching strategy—local memory, distributed cache, or persistent database—based on your speed, reliability, and consistency needs.
-
-
A candidate interviewing for a Senior Engineer @ Meta was asked to design a rate limiter. Another candidate at Google's L5 loop got hit with the same question. I've been asked this three times across different companies. Rate-limiting questions look simple until you add one layer of complexity: – Add distributed rate limiting? Now you're dealing with race conditions and clock skew. – Add multiple rate limit tiers? Welcome to priority queues and quota management. – Add per-user, per-IP, and per-API-key limits? Your Redis bill just exploded. Here's my personal checklist of 15 things you must get right when building rate limiters: 1. Always do rate limiting on the server, not the client → Client-side limits are useless. They’re easily bypassed, so always enforce limits on your backend. 2. Choose the right placement → For most web APIs, place the rate limiter at the API gateway or load balancer (the “edge”) for global protection and minimal added latency. 3. Identify users correctly → Use a combination of user ID, API key, and IP address. Apply stricter limits for anonymous/IP-only clients, higher for authenticated or premium users. 4. Support multiple rule types → Allow per-user, per-IP, and per-endpoint limits. Make rules configurable, not hardcoded. 5. Pick an algorithm that fits your needs → Know the pros/cons: – Fixed Window: Easy, but suffers from burst issues. – Sliding Log: Accurate, but memory-heavy. – Sliding Window Counter: Good balance, small memory footprint. – Token Bucket: Handles bursts and steady rates, an industry standard for distributed systems. 6. Store rate limit state in a fast, shared store → Use an in-memory cache like Redis or Memcached. Every gateway instance must read and write to this store, so limits are enforced globally. 7. Make every check atomic → Use atomic operations (e.g., Redis Lua scripts or MULTI/EXEC) to avoid race conditions and double-accepting requests. 8. Shard your cache for scale → Don’t rely on a single Redis instance. Use Redis Cluster or consistent hashing to scale horizontally and handle millions of users/requests. 9. Build in replication and failover → Each cache node should have replicas. If a primary fails, replicas take over. This keeps the system available and fault-tolerant. 10. Decide your “failure mode” → Fail-open (let all requests through if the cache is down) = risk of backend overload. Fail-closed (block all requests) = user-facing downtime. For critical APIs, prefer fail-closed to protect backend. 11. Return proper status codes and headers → Use HTTP 429 for “Too Many Requests.” Include headers like: – X-RateLimit-Limit, – X-RateLimit-Remaining, – X-RateLimit-Reset, Retry-After This helps clients know when to back off. 12. Use connection pooling for cache access → Avoid reconnecting to Redis on every check. Pool connections to minimize latency. Continued in Comments...
-
A candidate for an L5 role at Google failed their system design round because they couldn't explain tradeoffs well. The question was simple: "What store do you pick for a public API rate limiter?" The word "Redis" was the answer given within five seconds. It was not wrong but incomplete. Let me explain… High-scale design requires you to solve the constraints before you name a database. The storage choice should be the very last thing you decide. a) Define the performance requirements A rate limiter is a tax on every incoming request. You have to establish a latency budget before you look at any tech stack. – Exactness: Can you afford a 5% margin of error in the count? – Burst tolerance: How will the system react to a 10x spike in 100ms? – Coordination: Do multiple API nodes need to share a global counter? If the latency budget is under 1ms, a network call to a remote database is physically impossible. You have to keep the state local. b) Evaluate the storage tier trade-offs Every choice dictates how your API behaves when traffic hits. You are deciding where the complexity lives. – In-memory (Local): This is the fastest path. It uses the app’s own RAM. Latency is negligible, but every node has its own version of the truth. – Distributed (Redis): This allows all nodes to share a single counter. You get global consistency, but you add a network hop to every single API call. – Durable (SQL/NoSQL): Use this for billing-critical limits that must persist across restarts. The latency cost is massive. c) Design for failure behavior A centralized store is a single point of failure. If the rate limiter is down, you have to decide the fate of your API. – Fail open: You allow all traffic. This protects the user experience but risks a database meltdown during an attack. – Fail closed: You block all traffic. This protects the infrastructure but destroys your uptime. The store choice should support your fallback strategy. If you cannot fail closed, you likely need a hybrid approach with local overrides. d) Match the store to the constraint Finalize the decision using data. Avoid choosing a tool based on personal preference. – For high-speed APIs where global exactness is secondary, use local in-memory stores with sticky sessions. – For public APIs requiring a strict global ceiling, use a distributed cache like Redis or Memcached. – For billing-critical systems, use a local count that syncs to a durable store asynchronously. Start with the constraints. The tool name is just the final piece of the puzzle. Design for the failure scenario first.
-
Imagine you’re in a system design interview at Google for an L5 role, and the interviewer asks: “If 10M users hit your API at the same time and your rate limiter allows 1000 req/sec, what happens to the other 9.99M?” This is a classic overload-control + retry-amplification problem. Btw, if you’re preparing for system design interviews, check out our AI Tutor: https://lnkd.in/gcWfR7jW You can: - voice chat about your questions in real-time - get feedback in real time and improve with these sessions - learn concepts, practice HLD questions even if you're a complete beginner Here is how I would break it down. [1] Clarify what we actually need to build This is not just “return 429 when over the limit.” It is: - protect the backend from overload - keep latency stable for the requests we do accept - avoid retry storms from rejected clients - give clients a fair chance to recover - degrade gracefully instead of turning 10M requests into 20M So the core problem is not only rate limiting. It is admission control plus controlled recovery behavior. [2] The other 9.99M cannot all get immediate retries If all rejected requests get a 429 and retry immediately, the limiter becomes part of the problem. A better model is: - accept up to the allowed rate - reject excess traffic quickly - return backoff hints like `Retry-After` - force clients and SDKs to use exponential backoff + jitter - optionally queue a small bounded overflow only if the business case justifies it The key idea is simple: do not turn rejection into amplification. [3] High-level flow A reasonable design would be: - clients hit edge load balancers / API gateway - request first passes through a distributed rate limiter - accepted requests move to the backend - rejected requests get a fast 429 or graceful degradation response - clients retry later using backoff, not instantly - observability layer tracks rejection rate, retry rate, queue depth, and user impact The limiter is only one part. The client behavior matters just as much. [4] What should happen to the rejected traffic? This depends on the API. For example: - interactive read APIs: reject fast, retry later - write APIs: maybe accept into a bounded queue if loss is costly - idempotent operations: safer to retry - non-critical traffic: drop or degrade early - premium / internal traffic: separate priority buckets So the answer is not “all 9.99M get blocked.” The answer is “different classes of traffic may be handled differently.” [5] The tradeoffs interviewers care about This is where the answer gets interesting: - immediate 429 is cheap, but dangerous if clients retry badly - queues smooth bursts, but can increase latency and memory pressure - token bucket handles bursts better than a strict per-second counter - fairness matters so one tenant or region does not starve everyone else - backoff with jitter is critical to avoid synchronized retries - if the limiter itself fails, fail-open vs fail-closed depends on the API
-
I spent over 18 months of my life working 50-60-hour workweeks at Postman, when I restarted my career after doing my masters at IIITH. One of the biggest learning after dealing with APIs all day was that if you’re scaling your production API to millions of users, it’s best done with rate limiters. But how is it applied exactly, and how does it work? Let me break it down to you with the example of Stripe: ● Why Do APIs Need Rate Limiting? – When your app gets a sudden surge of users, API traffic can spike to millions of requests a minute. – Sometimes it’s a runaway script, a spam bot, or just honest users trying to batch-process a lot of data. – If one user (or bug) floods your servers, it can slow down or crash your service for everyone. – Rate limiting sets boundaries so no single user, bug, or partner can bring your API down. ● How Does Stripe Use Rate Limiters? Let’s break down Stripe’s 4-layer rate limiting and load shedding system. ● Request Rate Limiter – Limits how many API requests a user can send per second (e.g., 100 req/sec). – Prevents a single customer or buggy script from overloading the system. – Stripe uses the token bucket algorithm, every request “spends” a token, and tokens refill at a steady pace. – Allows for small, quick bursts in traffic (like during a flash sale) but smooths things back down. – If you go over, you get an HTTP 429 (“Too Many Requests”). ● Concurrent Requests Limiter – Restricts how many API requests you can have in progress at one time (e.g., only 20 running at once). – Useful for expensive endpoints that use a lot of CPU/memory. – Encourages users to finish their current jobs before sending new ones. – Solved real issues at Stripe, where too many simultaneous requests to heavy endpoints caused slowdowns. ● Fleet Usage Load Shedder – Reserves part of Stripe’s infrastructure for the most critical API requests. – Keeps essential operations (like charging a card) flowing, even if less-critical actions (like listing charges) have to wait. – If non-critical traffic uses up too much capacity, those requests are dropped (shed) to protect high-priority traffic. – Stripe splits infrastructure: for example, 20% saved for critical, 80% for everything else. ● Worker Utilization Load Shedder – Monitors how busy API workers (the backend servers) are. – When things get overloaded, starts dropping less important requests in order: + Test mode traffic goes first, + Then GETs, + Then POSTs, + Critical actions are always last to be dropped. – This protects the system during major incidents or sudden surges. – Shedding ramps up slowly, so the system doesn’t keep toggling between overload and normal (avoids “flapping”). Continued in comments...
-
Imagine your AI agent burned through $50K in API calls overnight. How could this happen? Simple, a lack of guardrails. Yes, autonomous AI systems are incredibly powerful but they can also be incredibly dangerous without proper boundaries. This is why "Design for Controlled Autonomy" is a core design principle in AWS's GenAI Lens Framework. Think about this: Would you give a junior developer root access to production on day one? No, so why would you let an AI agent operate without constraints? Here's what controlled autonomy looks like: ✓ Operational Requirements Define EXACTLY what your AI can and cannot do. Set token limits, rate limits, and scope boundaries. No exceptions. ✓ Security Controls Implement least-privilege access. Your AI should only touch what it needs to complete its task. The same applies to the tools you give it. Nothing more. ✓ Failure Conditions Build stopping conditions. Set thresholds for when the system should stop, alert, or fail gracefully. Assume failures WILL happen. ✓ Cost Boundaries Set hard caps on API calls, compute resources, and data processing. Monitor usage in real-time, not after the damage is done. ✓ Safe Parameters Define acceptable behavior ranges. If your AI starts acting outside these bounds, it should trigger immediate intervention. The goal is to implement your agent safely without limiting its potential. Autonomy without control = chaos. Control without autonomy = bottleneck. Controlled autonomy = scalable innovation. Most AI failures in production aren't model issues. They're architecture issues. Build the guardrails before you need them. Your future self (and your Leadership) will thank you. What's your approach to setting AI guardrails? Drop your strategies below 👇🏾 #AgenticAI #AIEngineering #CloudArchitecture #AWS #MachineLearning #MLOps #DevOps #ArtificialIntelligence
-
A year ago, when I started building my agentic AI platform for Cyber GRC, I ran straight into a wall most AI practitioners know too well: prompt -> review -> refine -> re-prompt -> repeat. Casual prompting meant endless iterations with the model, burning through tokens while still getting inconsistent results. Even on a Max license, I was hitting rate limits regularly. Not because I was doing anything exotic, but because unstructured AI usage is inherently wasteful. Such an approach is also a time drain and sometimes the loop can be endless. That experience forced a fundamental shift in how I approach AI workflows. I stopped optimizing prompts and started engineering harnesses. The difference was immediate. By decomposing workflows into discrete hops, with deterministic logic handling parsing, validation, and execution while reserving AI inference for the steps that genuinely need it, I started identifying problems sooner and writing more efficient code. Token consumption dropped. Output quality went up. Consistency became the default rather than the exception. A few patterns emerged that I now consider foundational: - Resolvers that evaluate each decision point and route to different execution paths optimized for efficiency, cost, or quality, depending on context. Not every task needs the most expensive inference call. - Drill-down chains modeled after the consulting "7 Whys" approach, where each hop peels back a layer toward root cause rather than asking the model to solve everything in one pass. - Capability-focused agents where agents propose and MCP services execute, with runbooks providing the specificity these agents need to operate precisely. That last point has become the architectural principle for the platform I am building: keep agents focused on what they do best (reasoning, analysis, recommendation) and let structured services handle execution. The runbook is the bridge; the harness is the steering mechanism. This is not theory. This is what moved my work from fighting rate limits to producing reliable outcomes. Curious to hear where you are in your journey? #HarnessEngineering #AIEngineering #PromptEngineering #AgenticAI #Cybersecurity #AIStrategy #TokenOptimization #AIWorkflows #FutureOfWork
-
🚨 If you're hitting rate/tokens limits exceeded errors on your AI models, you are not alone! 🤚 A rate limit error will occur when API requests are sent too quickly. Every LLM has rate limiting inbuilt to prevent abuse and you should have it too in your application! You may be hitting RPM (requests per minute), RPD (requests per day), TPM (tokens per minute), TPD (tokens per day) or IPM (images per minute) 👇👇 Here are few common strategies to avoid rate limiting: 💻 Application layer: Limit the number of requests users can make to control the rate of requests from your application. If your API limit is 20 requests/minute, add 3-6 second delays between calls. 🤖 Server side: Retrying with exponential backoff using libraries like "backoff", "tenacity" or custom implementation. Wait 1 sec → 2 sec → 4 sec → 8 sec with random jitter to prevent thundering herds. ⏭️ Multiple models: Having fallback providers (OpenAI → Anthropic → Google) and load balancing across multiple API keys/accounts to multiply your quota. One account = 1000 RPM, three accounts = 3000 RPM. 📦 Batching requests: If you're hitting "request per minute" limit and not "token per minute", consider batching multiple tasks into each request. Models like OpenAI, Gemini, Grok, Claude support it. 🧠 Caching: Store responses for repeated queries and implement semantic caching for similar requests. May not apply for all application. 📊 Monitoring: Implement logging and monitoring to track rate limit errors, retry counts, and overall API usage. Set alerts at 80% capacity, not when you're already down. 💡 Pro tip: Response headers can return important information about rate limits such as the remaining requests, tokens, and other metadata. Failed requests often still count against your quota! There are some libraries tools that provide built-in rate limiting and fallback strategies, such as: LangChain, OpenRouter, Portkey Plan for limits, don't fight them 💪 Lmk what you think of these approaches! 🚀🚀
-
Your app stopped working at 2pm. Every user getting "Too Many Requests" errors. You check logs. 429 errors everywhere. You hit Stripe's API rate limit. You didn't know you had one. You're making 100 API calls per second. Their limit is 25. Your app has been hammering their API for weeks. They rate limited you. Now nothing works. Customer checkouts failing. Support overwhelmed. Cost: $6K in lost revenue. Half a day fixing it. Here's what happened: Your app grew. More users. More API calls. Nobody was tracking API usage against provider limits. Nobody set up caching. Every request hit the API directly. You assumed infinite capacity. You were wrong. Here's what should exist: Know your API limits. Every third-party service has rate limits. Document them. Implement caching. Don't call the API for data you already have. Implement backoff. When you get rate limited, slow down automatically. Monitor API usage. Track calls per minute. Alert before you hit limits. Load test with realistic API constraints. Find limits before production does. If you don't know your third-party API limits, you will hit them in production. Not if. When. 28 years in technology. 17 years as CTO. Every API integration I've built included rate limit handling from day one. Integrating with third-party APIs and not sure about rate limits? Schedule a call at: bry.net Before you hit them and take down your app.