LangChain Tutorial (Python): rate limiting API calls for beginners

By Cyprian AaronsUpdated 2026-04-21
langchainrate-limiting-api-calls-for-beginnerspython

This tutorial shows you how to rate limit LangChain API calls in Python so your app stays under provider quotas and avoids 429 errors. You’ll build a simple wrapper around a LangChain chat model, then add per-call throttling that works in real code.

What You'll Need

  • Python 3.10+
  • A virtual environment
  • langchain
  • langchain-openai
  • openai
  • An OpenAI API key
  • Basic familiarity with LangChain chat models and .invoke()

Install the packages:

pip install langchain langchain-openai openai

Set your API key:

export OPENAI_API_KEY="your-api-key-here"

Step-by-Step

  1. Start with a plain LangChain chat model. This gives you a baseline before adding throttling, and it uses the standard invoke() method you’ll see in most production code.
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

response = llm.invoke([HumanMessage(content="Write one sentence about rate limiting.")])
print(response.content)
  1. Add a simple rate limiter using time-based spacing between calls. This is the easiest pattern for beginners: every request waits long enough so you never exceed your target requests-per-second.
import time
from threading import Lock

class SimpleRateLimiter:
    def __init__(self, min_interval_seconds: float):
        self.min_interval_seconds = min_interval_seconds
        self._lock = Lock()
        self._last_call_time = 0.0

    def wait(self):
        with self._lock:
            now = time.time()
            elapsed = now - self._last_call_time
            sleep_for = self.min_interval_seconds - elapsed
            if sleep_for > 0:
                time.sleep(sleep_for)
            self._last_call_time = time.time()
  1. Wrap the LangChain model so every call passes through the limiter first. This keeps your application code clean, and you can reuse the wrapper anywhere you call the model.
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

class RateLimitedChatModel:
    def __init__(self, model_name: str, min_interval_seconds: float):
        self.llm = ChatOpenAI(model=model_name, temperature=0)
        self.limiter = SimpleRateLimiter(min_interval_seconds)

    def invoke(self, messages):
        self.limiter.wait()
        return self.llm.invoke(messages)

rate_limited_llm = RateLimitedChatModel("gpt-4o-mini", min_interval_seconds=2.0)

for i in range(3):
    result = rate_limited_llm.invoke([
        HumanMessage(content=f"Return only the number {i}.")
    ])
    print(i, "=>", result.content)
  1. If you’re using a chain, apply the same wrapper at the model boundary. The chain doesn’t need to know anything about rate limits; it just calls the wrapped model like normal.
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are concise."),
    ("human", "{topic}")
])

class PromptedRateLimitedModel:
    def __init__(self, model_name: str, min_interval_seconds: float):
        self.llm = RateLimitedChatModel(model_name, min_interval_seconds)

    def invoke(self, inputs):
        messages = prompt.invoke(inputs).to_messages()
        return self.llm.invoke(messages)

chain_model = PromptedRateLimitedModel("gpt-4o-mini", 1.5)

output = chain_model.invoke({"topic": "Explain retry-safe API usage in one line."})
print(output.content)
  1. Add retry handling for real-world bursts. Rate limiting prevents most quota issues, but retries help when the provider still returns transient failures like 429s or 503s.
import time

def invoke_with_retry(model, messages, max_retries=3):
    for attempt in range(max_retries + 1):
        try:
            return model.invoke(messages)
        except Exception as e:
            if attempt == max_retries:
                raise
            wait_seconds = 2 ** attempt
            print(f"Retrying after error: {e}. Waiting {wait_seconds}s")
            time.sleep(wait_seconds)

message_batch = [HumanMessage(content="Give me a short fact about Python.")]
result = invoke_with_retry(rate_limited_llm, message_batch)
print(result.content)

Testing It

Run the script and watch the timestamps or visible pauses between requests. With min_interval_seconds=2.0, each call should wait roughly two seconds before hitting the API.

If you want to verify it more precisely, add time.perf_counter() around each invoke() call and compare elapsed times across multiple requests. You should see consistent spacing instead of back-to-back bursts.

Also test failure behavior by temporarily lowering your provider quota or sending requests in a loop from multiple processes. The limiter should reduce how often you hit 429s, and the retry wrapper should recover from occasional transient errors.

Next Steps

  • Learn token-based throttling instead of request-based throttling for models with strict TPM limits.
  • Move this logic into middleware or a shared service if multiple workers need coordinated limits.
  • Add async support with ainvoke() if your LangChain app uses asyncio heavily.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides