LangChain Tutorial (Python): rate limiting API calls for beginners
This tutorial shows you how to rate limit LangChain API calls in Python so your app stays under provider quotas and avoids 429 errors. You’ll build a simple wrapper around a LangChain chat model, then add per-call throttling that works in real code.
What You'll Need
- •Python 3.10+
- •A virtual environment
- •
langchain - •
langchain-openai - •
openai - •An OpenAI API key
- •Basic familiarity with LangChain chat models and
.invoke()
Install the packages:
pip install langchain langchain-openai openai
Set your API key:
export OPENAI_API_KEY="your-api-key-here"
Step-by-Step
- •Start with a plain LangChain chat model. This gives you a baseline before adding throttling, and it uses the standard
invoke()method you’ll see in most production code.
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
response = llm.invoke([HumanMessage(content="Write one sentence about rate limiting.")])
print(response.content)
- •Add a simple rate limiter using time-based spacing between calls. This is the easiest pattern for beginners: every request waits long enough so you never exceed your target requests-per-second.
import time
from threading import Lock
class SimpleRateLimiter:
def __init__(self, min_interval_seconds: float):
self.min_interval_seconds = min_interval_seconds
self._lock = Lock()
self._last_call_time = 0.0
def wait(self):
with self._lock:
now = time.time()
elapsed = now - self._last_call_time
sleep_for = self.min_interval_seconds - elapsed
if sleep_for > 0:
time.sleep(sleep_for)
self._last_call_time = time.time()
- •Wrap the LangChain model so every call passes through the limiter first. This keeps your application code clean, and you can reuse the wrapper anywhere you call the model.
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
class RateLimitedChatModel:
def __init__(self, model_name: str, min_interval_seconds: float):
self.llm = ChatOpenAI(model=model_name, temperature=0)
self.limiter = SimpleRateLimiter(min_interval_seconds)
def invoke(self, messages):
self.limiter.wait()
return self.llm.invoke(messages)
rate_limited_llm = RateLimitedChatModel("gpt-4o-mini", min_interval_seconds=2.0)
for i in range(3):
result = rate_limited_llm.invoke([
HumanMessage(content=f"Return only the number {i}.")
])
print(i, "=>", result.content)
- •If you’re using a chain, apply the same wrapper at the model boundary. The chain doesn’t need to know anything about rate limits; it just calls the wrapped model like normal.
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", "You are concise."),
("human", "{topic}")
])
class PromptedRateLimitedModel:
def __init__(self, model_name: str, min_interval_seconds: float):
self.llm = RateLimitedChatModel(model_name, min_interval_seconds)
def invoke(self, inputs):
messages = prompt.invoke(inputs).to_messages()
return self.llm.invoke(messages)
chain_model = PromptedRateLimitedModel("gpt-4o-mini", 1.5)
output = chain_model.invoke({"topic": "Explain retry-safe API usage in one line."})
print(output.content)
- •Add retry handling for real-world bursts. Rate limiting prevents most quota issues, but retries help when the provider still returns transient failures like 429s or 503s.
import time
def invoke_with_retry(model, messages, max_retries=3):
for attempt in range(max_retries + 1):
try:
return model.invoke(messages)
except Exception as e:
if attempt == max_retries:
raise
wait_seconds = 2 ** attempt
print(f"Retrying after error: {e}. Waiting {wait_seconds}s")
time.sleep(wait_seconds)
message_batch = [HumanMessage(content="Give me a short fact about Python.")]
result = invoke_with_retry(rate_limited_llm, message_batch)
print(result.content)
Testing It
Run the script and watch the timestamps or visible pauses between requests. With min_interval_seconds=2.0, each call should wait roughly two seconds before hitting the API.
If you want to verify it more precisely, add time.perf_counter() around each invoke() call and compare elapsed times across multiple requests. You should see consistent spacing instead of back-to-back bursts.
Also test failure behavior by temporarily lowering your provider quota or sending requests in a loop from multiple processes. The limiter should reduce how often you hit 429s, and the retry wrapper should recover from occasional transient errors.
Next Steps
- •Learn token-based throttling instead of request-based throttling for models with strict TPM limits.
- •Move this logic into middleware or a shared service if multiple workers need coordinated limits.
- •Add async support with
ainvoke()if your LangChain app uses asyncio heavily.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit