How to Fix 'timeout error when scaling' in LangChain (Python)
When you see timeout error when scaling in a LangChain Python app, it usually means your chain or agent is doing too much work inside one request and the downstream call is timing out. In practice, this shows up when you scale from a single local test to concurrent traffic, larger prompts, slower tools, or remote model providers with tighter request limits.
The fix is usually not “increase timeout” blindly. You need to find whether the timeout comes from the LLM call, a retriever/tool call, or your own concurrency pattern.
The Most Common Cause
The #1 cause is blocking work inside a chain step that runs longer than the configured timeout. This often happens with RunnableParallel, tool calls, or synchronous I/O inside an async pipeline.
Here’s the broken pattern:
# BROKEN: synchronous tool call + no timeout control
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda
def slow_lookup(query: str) -> str:
# Simulates a slow DB/API call
import time
time.sleep(12)
return f"result for {query}"
llm = ChatOpenAI(model="gpt-4o-mini", timeout=10)
chain = RunnableLambda(lambda x: slow_lookup(x["query"])) | llm
result = chain.invoke({"query": "policy status"})
And the fixed pattern:
# FIXED: move slow I/O out of blocking lambda and set explicit timeouts
import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda
async def slow_lookup(query: str) -> str:
await asyncio.sleep(2)
return f"result for {query}"
llm = ChatOpenAI(model="gpt-4o-mini", timeout=30)
async def build_input(x):
data = await slow_lookup(x["query"])
return {"input": data}
chain = RunnableLambda(build_input) | llm
result = asyncio.run(chain.ainvoke({"query": "policy status"}))
Why this matters:
- •
invoke()blocks the event loop if you mix it with async code badly. - •A
timeout=10onChatOpenAIdoes not help if your own preprocessing takes 12 seconds. - •Under load, these delays stack up and look like scaling failures.
Other Possible Causes
1) Too much parallelism hitting provider rate limits
If you fan out requests with batch() or RunnableParallel, you can trigger upstream throttling that surfaces as timeouts.
# Risky under load
results = chain.batch(inputs, config={"max_concurrency": 50})
Fix it by lowering concurrency:
results = chain.batch(inputs, config={"max_concurrency": 5})
If you’re using OpenAI or Azure OpenAI, also watch for HTTP 429s wrapped as retries that end in:
- •
openai.APITimeoutError - •
httpx.ReadTimeout - •
LangChainRetryError
2) Retriever or vector store latency
A slow retriever can make the whole chain look like the LLM timed out.
# Example: retriever call is the bottleneck
docs = retriever.get_relevant_documents(query)
Common fixes:
- •Reduce top-k results
- •Add metadata filters
- •Cache embeddings or retrieval results
- •Move to async retrieval with
aget_relevant_documents()
Config example:
retriever.search_kwargs = {"k": 3}
3) Tool execution without its own timeout
LangChain agents often call external APIs through tools. If the tool hangs, the agent waits until the parent request dies.
@tool
def fetch_claim_status(claim_id: str) -> str:
return requests.get(f"https://api.example.com/claims/{claim_id}").text
Fix it by setting a real network timeout:
@tool
def fetch_claim_status(claim_id: str) -> str:
resp = requests.get(
f"https://api.example.com/claims/{claim_id}",
timeout=(3.0, 10.0),
)
return resp.text
4) Context window bloat from oversized prompts
If your prompt grows with every retrieved document or chat turn, model latency increases fast.
prompt = "\n\n".join([doc.page_content for doc in docs]) # too much text
Trim aggressively:
prompt = "\n\n".join([doc.page_content[:1000] for doc in docs[:3]])
This is common with:
- •
ConversationalRetrievalChain - •long chat histories in memory classes like
ConversationBufferMemory - •large document stuffing chains
How to Debug It
- •
Find which step is slow
- •Add timestamps around retrieval, tool calls, and LLM invocation.
- •If only one stage spikes, that’s your culprit.
- •
Turn on LangChain tracing
- •Use LangSmith or verbose logging.
- •Look for where execution stalls before the final exception.
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "timeout-debug"
- •
Check the exact exception type
- •Different failures point to different layers.
- •Examples:
- •
openai.APITimeoutError→ provider/network issue - •
httpx.ReadTimeout→ HTTP client timeout - •
asyncio.TimeoutError→ your own coroutine timeout wrapper
- •
- •
Reduce concurrency to isolate scaling issues
- •Run one request at a time.
- •Then try
max_concurrency=2, then5, then higher. - •If failures appear only above a threshold, it’s load-related.
Prevention
- •
Set explicit timeouts at every layer:
- •LLM client timeout
- •HTTP request timeout in tools
- •database/query timeout in retrievers
- •
Keep chains small and predictable:
- •cap retrieved documents
- •trim chat history
- •avoid giant prompt stuffing patterns
- •
Load test before production:
- •run concurrent batch tests against real providers and real tools
- •watch p95 latency, not just average latency
If you want one rule to remember: don’t treat LangChain as magic glue. Every external dependency in your chain needs its own timeout, concurrency limit, and failure mode handled explicitly.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit