How to Fix 'connection timeout' in LlamaIndex (Python)
What the error means
connection timeout in LlamaIndex usually means your Python process tried to reach an upstream service, but the request never completed before the network client gave up. In practice, this shows up when LlamaIndex is calling an LLM, embedding API, vector database, or document source over HTTP.
You’ll typically see it during query_engine.query(...), index.from_documents(...), embedding generation, or any remote retriever call. The stack trace often includes httpx.ConnectTimeout, httpx.ReadTimeout, or a wrapper like openai.APITimeoutError.
The Most Common Cause
The #1 cause is using the wrong endpoint or a service that is unreachable from your runtime. In LlamaIndex, this often happens when Settings.llm or Settings.embed_model points to a remote provider with bad base URL, missing proxy config, or a dead host.
Here’s the broken pattern and the fixed pattern side by side:
| Broken | Fixed |
|---|---|
| ```python | |
| from llama_index.core import Settings | |
| from llama_index.llms.openai import OpenAI |
Wrong: bad base_url or unreachable host
Settings.llm = OpenAI( model="gpt-4o-mini", api_key="sk-xxx", api_base="https://api.openai.local/v1", # not reachable )
response = Settings.llm.complete("Hello")
print(response)
|python
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
Right: valid endpoint
Settings.llm = OpenAI( model="gpt-4o-mini", api_key="sk-xxx", api_base="https://api.openai.com/v1", )
response = Settings.llm.complete("Hello") print(response)
If you’re using Azure OpenAI, Ollama, or a self-hosted gateway, make sure the host is reachable from the machine running Python. A local notebook can hit `localhost`, but Docker, Kubernetes, and CI runners cannot.
## Other Possible Causes
### 1) Embedding calls are timing out
LlamaIndex may be fine, but embeddings are slow or blocked. This is common with large batches or cold starts.
```python
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_key="sk-xxx",
timeout=10,
)
If your document set is large, reduce batch size or increase timeout.
2) Your vector store is unreachable
If you use Pinecone, Qdrant, Weaviate, Milvus, or Postgres-backed retrieval and the server is down or misconfigured, queries can fail with timeouts during retrieval.
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
client = QdrantClient(url="http://qdrant:6333") # wrong if container DNS doesn't resolve
vector_store = QdrantVectorStore(client=client)
Fix the hostname and verify the port is open from the same runtime.
3) You’re hitting default timeout limits
Some clients default to short timeouts. Long prompts, slow models, and large context windows can exceed them.
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o-mini",
api_key="sk-xxx",
timeout=60,
)
Use a higher timeout for production workloads that call external APIs over unreliable networks.
4) Proxy or firewall rules are blocking outbound traffic
This happens a lot in enterprise environments. Your code works locally but times out in VPCs, corporate laptops, or locked-down containers.
export HTTPS_PROXY=http://proxy.company.local:8080
export HTTP_PROXY=http://proxy.company.local:8080
If your org requires egress allowlisting, confirm the destination domain is approved.
How to Debug It
- •
Find which call times out
- •Look at the stack trace and identify whether it fails in:
- •
OpenAI.complete(...) - •
OpenAIEmbedding.get_text_embedding(...) - •vector store query methods
- •
- •The last LlamaIndex class in the trace tells you where to focus.
- •Look at the stack trace and identify whether it fails in:
- •
Reproduce outside LlamaIndex
- •Call the same endpoint with
curlor a minimal Python script. - •If raw HTTP also times out, this is not a LlamaIndex bug.
- •Call the same endpoint with
import httpx
r = httpx.get("https://api.openai.com/v1/models", timeout=10)
print(r.status_code)
- •Turn on verbose logging
- •Log request timing around index creation and query execution.
- •Measure whether embeddings, retrieval, or generation is slow.
import time
start = time.time()
response = query_engine.query("What does this document say?")
print(f"Elapsed: {time.time() - start:.2f}s")
print(response)
- •Test each dependency separately
- •Validate:
- •LLM endpoint
- •embedding endpoint
- •vector DB connection
- •document source access
- •One failing dependency can surface as a generic timeout in your app layer.
- •Validate:
Prevention
- •Set explicit timeouts on every remote client used by LlamaIndex.
- •Use health checks for LLMs, embedding services, and vector stores before starting ingestion jobs.
- •Keep endpoints configurable via environment variables so dev, staging, and prod don’t share hardcoded URLs.
- •For large workloads, batch embeddings and add retries with backoff instead of relying on one long request.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit