How to Fix 'connection timeout' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
connection-timeoutllamaindexpython

What the error means

connection timeout in LlamaIndex usually means your Python process tried to reach an upstream service, but the request never completed before the network client gave up. In practice, this shows up when LlamaIndex is calling an LLM, embedding API, vector database, or document source over HTTP.

You’ll typically see it during query_engine.query(...), index.from_documents(...), embedding generation, or any remote retriever call. The stack trace often includes httpx.ConnectTimeout, httpx.ReadTimeout, or a wrapper like openai.APITimeoutError.

The Most Common Cause

The #1 cause is using the wrong endpoint or a service that is unreachable from your runtime. In LlamaIndex, this often happens when Settings.llm or Settings.embed_model points to a remote provider with bad base URL, missing proxy config, or a dead host.

Here’s the broken pattern and the fixed pattern side by side:

BrokenFixed
```python
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Wrong: bad base_url or unreachable host

Settings.llm = OpenAI( model="gpt-4o-mini", api_key="sk-xxx", api_base="https://api.openai.local/v1", # not reachable )

response = Settings.llm.complete("Hello") print(response) |python from llama_index.core import Settings from llama_index.llms.openai import OpenAI

Right: valid endpoint

Settings.llm = OpenAI( model="gpt-4o-mini", api_key="sk-xxx", api_base="https://api.openai.com/v1", )

response = Settings.llm.complete("Hello") print(response)


If you’re using Azure OpenAI, Ollama, or a self-hosted gateway, make sure the host is reachable from the machine running Python. A local notebook can hit `localhost`, but Docker, Kubernetes, and CI runners cannot.

## Other Possible Causes

### 1) Embedding calls are timing out

LlamaIndex may be fine, but embeddings are slow or blocked. This is common with large batches or cold starts.

```python
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key="sk-xxx",
    timeout=10,
)

If your document set is large, reduce batch size or increase timeout.

2) Your vector store is unreachable

If you use Pinecone, Qdrant, Weaviate, Milvus, or Postgres-backed retrieval and the server is down or misconfigured, queries can fail with timeouts during retrieval.

from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

client = QdrantClient(url="http://qdrant:6333")  # wrong if container DNS doesn't resolve
vector_store = QdrantVectorStore(client=client)

Fix the hostname and verify the port is open from the same runtime.

3) You’re hitting default timeout limits

Some clients default to short timeouts. Long prompts, slow models, and large context windows can exceed them.

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    api_key="sk-xxx",
    timeout=60,
)

Use a higher timeout for production workloads that call external APIs over unreliable networks.

4) Proxy or firewall rules are blocking outbound traffic

This happens a lot in enterprise environments. Your code works locally but times out in VPCs, corporate laptops, or locked-down containers.

export HTTPS_PROXY=http://proxy.company.local:8080
export HTTP_PROXY=http://proxy.company.local:8080

If your org requires egress allowlisting, confirm the destination domain is approved.

How to Debug It

  1. Find which call times out

    • Look at the stack trace and identify whether it fails in:
      • OpenAI.complete(...)
      • OpenAIEmbedding.get_text_embedding(...)
      • vector store query methods
    • The last LlamaIndex class in the trace tells you where to focus.
  2. Reproduce outside LlamaIndex

    • Call the same endpoint with curl or a minimal Python script.
    • If raw HTTP also times out, this is not a LlamaIndex bug.
import httpx

r = httpx.get("https://api.openai.com/v1/models", timeout=10)
print(r.status_code)
  1. Turn on verbose logging
    • Log request timing around index creation and query execution.
    • Measure whether embeddings, retrieval, or generation is slow.
import time

start = time.time()
response = query_engine.query("What does this document say?")
print(f"Elapsed: {time.time() - start:.2f}s")
print(response)
  1. Test each dependency separately
    • Validate:
      • LLM endpoint
      • embedding endpoint
      • vector DB connection
      • document source access
    • One failing dependency can surface as a generic timeout in your app layer.

Prevention

  • Set explicit timeouts on every remote client used by LlamaIndex.
  • Use health checks for LLMs, embedding services, and vector stores before starting ingestion jobs.
  • Keep endpoints configurable via environment variables so dev, staging, and prod don’t share hardcoded URLs.
  • For large workloads, batch embeddings and add retries with backoff instead of relying on one long request.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides