How to Fix 'connection timeout in production' in LlamaIndex (Python)
If you’re seeing connection timeout in production with LlamaIndex, it usually means your app tried to call an external service — OpenAI, a vector DB, an embedding endpoint, or your own API — and the request didn’t complete before the network stack gave up. In practice, this shows up under load, behind a proxy, or when your LlamaIndex client is using defaults that are fine locally but too aggressive for production.
The key detail: this is usually not a “LlamaIndex bug.” It’s almost always a timeout, networking, or client configuration problem around the LlamaIndex component you’re using.
The Most Common Cause
The #1 cause is using the default LLM/embedding client settings in code that runs in production with slower network paths, cold starts, or larger payloads.
A common pattern is creating OpenAI / OpenAIEmbedding / PineconeVectorStore clients without explicit timeout and retry settings. Under real traffic, that turns into errors like:
- •
httpx.ReadTimeout - •
httpx.ConnectTimeout - •
openai.APITimeoutError - •
TimeoutError: Request timed out - •
ValueError: Could not complete request to OpenAI API
Broken vs fixed
| Broken pattern | Fixed pattern |
|---|---|
| Uses defaults, no timeout tuning | Sets explicit timeouts and retries |
| Recreates clients per request | Reuses long-lived clients |
| No observability around failures | Logs the exact failing component |
# BROKEN
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
llm = OpenAI(model="gpt-4o-mini")
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# This may work locally but fail in production under latency.
# FIXED
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
llm = OpenAI(
model="gpt-4o-mini",
timeout=60.0,
max_retries=3,
)
embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
timeout=60.0,
max_retries=3,
)
If you’re using the lower-level OpenAI SDK directly through LlamaIndex integrations, make sure the underlying HTTP client also has sane connect/read timeouts. Production traffic needs headroom.
Other Possible Causes
1) You are recreating clients inside the request path
This is common in FastAPI, Flask, Celery tasks, and serverless functions. Each request builds a new LLM client and opens new connections.
# BAD: new client on every request
def answer ಪ್ರಶ್ನ(question: str):
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")
return llm.complete(question)
# GOOD: create once and reuse
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", timeout=60.0)
def answer_question(question: str):
return llm.complete(question)
2) Your vector DB is timing out during retrieval
If the issue happens during VectorStoreIndex.from_vector_store() or query-time retrieval, the bottleneck may be Pinecone, Qdrant, Weaviate, Postgres/pgvector, or Milvus.
# Example: retrieval path causing timeouts
query_engine = index.as_query_engine(similarity_top_k=20)
response = query_engine.query("Summarize policy exclusions")
Fixes:
- •Lower
similarity_top_k - •Add indexes on your vector store side
- •Increase read timeout for the vector DB client
- •Reduce metadata payload size
3) Proxy / firewall / NAT gateway issues
In enterprise environments, outbound traffic can be delayed or blocked by proxy layers. Python may surface this as a connection timeout even though your code is correct.
export HTTPS_PROXY=http://proxy.company.local:8080
export HTTP_PROXY=http://proxy.company.local:8080
If your environment requires proxy config and you don’t set it correctly, requests to LLM APIs can hang until they time out.
4) Large payloads are making requests slow
Huge context windows, oversized documents, or giant tool outputs increase latency. That often shows up when using ContextChatEngine, RouterQueryEngine, or when chunking is too coarse.
# Too much context sent at once
index = VectorStoreIndex.from_documents(documents) # documents are huge and poorly chunked
query_engine = index.as_query_engine()
Better:
- •Reduce chunk size
- •Increase chunk overlap only if needed
- •Summarize before retrieval if documents are massive
How to Debug It
- •
Identify which LlamaIndex call fails
- •Is it embedding creation?
- •Retrieval?
- •LLM completion?
- •Vector store query?
Wrap each stage separately so you know where the timeout happens.
- •
Log the exact exception class
- •Look for
httpx.ConnectTimeout,httpx.ReadTimeout,openai.APITimeoutError, or provider-specific exceptions. - •The class tells you whether this is connection setup vs slow response vs upstream service failure.
- •Look for
- •
Measure latency outside LlamaIndex
- •Hit the same endpoint with
curlor a minimal Python script. - •If raw requests are slow too, this is infrastructure or provider latency — not indexing logic.
- •Hit the same endpoint with
- •
Reduce concurrency and payload size
- •Drop batch sizes.
- •Lower retrieval top-k.
- •Temporarily test with one short document and one simple prompt.
If that works, your production workload is too heavy for current timeouts.
Prevention
- •Set explicit timeouts and retries on every external client used by LlamaIndex.
- •Reuse long-lived clients; don’t construct LLM/vector DB clients per request.
- •Keep chunks smaller and retrieval narrower so queries don’t drag huge payloads through the network.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit