How to Fix 'timeout error in production' in LlamaIndex (Python)
When you see timeout error in production with LlamaIndex, it usually means one of your upstream calls is taking longer than the default timeout window. In practice, this shows up when LLM calls, embedding calls, vector DB queries, or document ingestion are running under production latency and network conditions.
The key point: this is rarely a “LlamaIndex bug.” It’s usually a timeout configuration problem, a slow dependency, or a request pattern that works locally but fails under load.
The Most Common Cause
The #1 cause is using the default client timeout for an operation that takes too long in production.
This happens a lot with OpenAI, AzureOpenAI, Anthropic, or any remote service wrapped by LlamaIndex. The request succeeds locally because your test corpus is small and the network is clean. In production, the same call hits a larger payload or slower endpoint and raises something like:
- •
openai.APITimeoutError: Request timed out - •
httpx.ReadTimeout: The read operation timed out - •
TimeoutError: timed out
Broken vs fixed pattern
| Broken | Fixed |
|---|---|
| Uses default timeout | Sets an explicit timeout |
| No retry strategy | Adds retries/backoff |
| Long-running ingestion in a request path | Moves ingestion to background/job flow |
# BROKEN
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
docs = SimpleDirectoryReader("./data").load_data()
llm = OpenAI(model="gpt-4o") # default timeout can be too short in prod
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("Summarize the policy changes")
print(response)
# FIXED
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
docs = SimpleDirectoryReader("./data").load_data()
llm = OpenAI(
model="gpt-4o",
timeout=120.0, # give the request room to complete
max_retries=3 # handle transient failures better
)
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("Summarize the policy changes")
print(response)
If you’re seeing httpx.ReadTimeout or openai.APITimeoutError, this is the first thing to fix.
Other Possible Causes
1) Embedding calls are timing out during indexing
Large batches of documents can make embedding requests slow enough to fail.
# CONFIG FIX
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
model="text-embedding-3-large",
timeout=120.0,
max_retries=3,
)
If you’re indexing thousands of chunks at once, reduce batch size or move indexing off the web request path.
2) Your vector database is slow or underprovisioned
A slow Pinecone, Weaviate, Qdrant, or Postgres vector store can trigger timeouts during retrieval.
# EXAMPLE: make retrieval less expensive
query_engine = index.as_query_engine(
similarity_top_k=3 # lower than 10 or 20 if latency is high
)
Also check your vector DB indexes, connection pool limits, and whether you’re querying across too many namespaces/collections.
3) You are doing synchronous work inside an API request
This is common in FastAPI/Flask apps where document loading, indexing, and querying all happen inline.
# BAD: heavy work inside request handler
@app.post("/ask")
def ask():
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
return index.as_query_engine().query("What changed?")
Move ingestion/indexing to a background worker and only query prebuilt indexes from the API layer.
4) Context windows are too large
If you send too much text into the LLM, generation slows down and timeouts become more likely.
# Reduce payload size
query_engine = index.as_query_engine(
similarity_top_k=2,
response_mode="compact"
)
Large prompts also increase token usage and cost. In production, that usually becomes both a latency and budget problem.
How to Debug It
- •
Identify which call is timing out
- •Check logs for the exact exception:
- •
openai.APITimeoutError - •
httpx.ReadTimeout - •
TimeoutError
- •
- •If it happens during
VectorStoreIndex.from_documents(), it’s probably embeddings. - •If it happens during
.query(), it’s likely retrieval or generation.
- •Check logs for the exact exception:
- •
Add timing around each stage
import time start = time.time() docs = SimpleDirectoryReader("./data").load_data() print("load:", time.time() - start) start = time.time() index = VectorStoreIndex.from_documents(docs) print("index:", time.time() - start) start = time.time() response = index.as_query_engine().query("What changed?") print("query:", time.time() - start) - •
Increase timeout one layer at a time
- •First increase LLM timeout.
- •Then embedding timeout.
- •Then vector DB client timeout.
- •If one change fixes it, you found the bottleneck.
- •
Test with smaller input
- •One document instead of ten thousand.
- •Top-k of 2 instead of 10.
- •Short prompt instead of full context.
If the error disappears with smaller payloads, your issue is load-related, not code-related.
Prevention
- •Set explicit timeouts on every external client: LLMs, embeddings, HTTP clients, and vector stores.
- •Keep indexing out of request handlers. Precompute indexes in jobs or workers.
- •Log stage timings so you can tell whether failures happen in loading, embedding, retrieval, or generation.
- •Use smaller chunks and lower
similarity_top_kunless you have measured latency headroom.
If you want one rule to keep: never let production inherit default timeout settings from local dev. That’s how httpx.ReadTimeout becomes your pager at 2 AM.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit