LangChain Tutorial (Python): rate limiting API calls for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

langchainrate-limiting-api-calls-for-intermediate-developerspython

This tutorial shows you how to rate limit LangChain API calls in Python so your app stops hammering providers like OpenAI, Anthropic, or any HTTP-backed tool chain. You need this when you have bursts of traffic, shared quotas, or an agent that loops too aggressively and starts throwing 429s.

What You'll Need

•Python 3.10+
•langchain, langchain-openai, and tenacity
•An OpenAI API key exported as OPENAI_API_KEY
•Optional: a Redis instance if you want distributed rate limiting across multiple workers
•A basic LangChain setup with ChatOpenAI

Step-by-Step

•Install the packages and verify your environment.
Keep the dependency set small; for most cases, a local token bucket plus retry logic is enough.

pip install langchain langchain-openai tenacity
export OPENAI_API_KEY="your-api-key"

•Build a simple token bucket limiter in Python.
This version limits calls per minute in-process, which is fine for a single worker or local dev box.

import time
import threading


class TokenBucketRateLimiter:
    def __init__(self, rate_per_minute: int):
        self.capacity = rate_per_minute
        self.tokens = rate_per_minute
        self.refill_rate = rate_per_minute / 60.0
        self.updated_at = time.monotonic()
        self.lock = threading.Lock()

    def acquire(self) -> None:
        while True:
            with self.lock:
                now = time.monotonic()
                elapsed = now - self.updated_at
                self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
                self.updated_at = now

                if self.tokens >= 1:
                    self.tokens -= 1
                    return

                wait_time = (1 - self.tokens) / self.refill_rate

            time.sleep(wait_time)

•Wrap your LangChain call with the limiter and retry on transient failures.
The limiter controls request pace, and retries handle occasional provider-side throttling or network noise.

import os
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
from langchain_openai import ChatOpenAI

limiter = TokenBucketRateLimiter(rate_per_minute=30)

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    api_key=os.environ["OPENAI_API_KEY"],
)


@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(initial=1, max=20),
)
def limited_chat(prompt: str) -> str:
    limiter.acquire()
    response = llm.invoke(prompt)
    return response.content


print(limited_chat("Write one sentence about rate limiting in AI apps."))

•Use the same limiter inside a LangChain pipeline.
This is the pattern you want when your chain has multiple steps but only one external model call needs protection.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise assistant."),
    ("user", "{topic}")
])

chain = prompt | llm | StrOutputParser()


def limited_chain(topic: str) -> str:
    limiter.acquire()
    return chain.invoke({"topic": topic})


print(limited_chain("Explain why rate limiting matters for API integrations"))

•If you need concurrency, gate every outbound call through the same limiter.
This prevents a thread pool or async worker fan-out from bypassing your quota control.

from concurrent.futures import ThreadPoolExecutor


topics = [
    "What is a token bucket?",
    "Why do APIs return 429?",
    "How do retries help with throttling?",
]

with ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(limited_chain, topics))

for result in results:
    print(result)

Testing It

Run the script with three or more prompts in quick succession and watch the timestamps between requests; after the initial burst capacity is used, calls should slow down instead of firing all at once. If you lower rate_per_minute to something small like 2, the pacing becomes obvious immediately.

Then intentionally send more requests than your provider allows and confirm you see fewer 429s because your app is smoothing traffic before it hits the API. If you still get throttled, reduce concurrency first, then lower the token bucket rate until the errors disappear.

Next Steps

•Add Redis-backed distributed limiting so multiple app instances share one quota.
•Move from sync to async with ainvoke() and an async semaphore-based limiter.
•Track per-user and per-tenant quotas so one noisy client does not starve everyone else.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit