How to Fix 'deployment crash' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
deployment-crashllamaindexpython

When you see deployment crash in a LlamaIndex app, it usually means the model endpoint accepted your request but failed while serving it. In practice, this shows up when you call an LLM or embedding deployment with the wrong model name, missing credentials, bad transport config, or a payload the backend can’t handle.

This error often appears during Settings.llm initialization, VectorStoreIndex.from_documents(...), or the first query against a remote provider like Azure OpenAI, OpenAI-compatible gateways, or local inference servers.

The Most Common Cause

The #1 cause is a mismatch between the deployment name and the actual model/provider configuration.

With LlamaIndex, people often wire up AzureOpenAI or an OpenAI-compatible client and assume the model= value is the same as the deployment name. On Azure, that’s not true: model must match your deployment name, and api_base, api_version, and credentials must also line up.

Broken vs fixed

Broken patternFixed pattern
Uses a model name that does not exist as a deploymentUses the exact Azure deployment name
Omits required Azure configSets endpoint, key, and API version explicitly
# BROKEN
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core import Settings

llm = AzureOpenAI(
    model="gpt-4o",  # wrong if your Azure deployment is named "prod-gpt4o"
    deployment_name="gpt-4o",
    api_key="...",
    azure_endpoint="https://my-resource.openai.azure.com/",
)

Settings.llm = llm
# FIXED
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core import Settings

llm = AzureOpenAI(
    model="prod-gpt4o",  # must match the Azure deployment name
    deployment_name="prod-gpt4o",
    api_key="...",
    azure_endpoint="https://my-resource.openai.azure.com/",
    api_version="2024-02-15-preview",
)

Settings.llm = llm

If this is your issue, you’ll usually see something like:

  • BadRequestError: Error code: 400 - {'error': {'message': 'deployment crash', ...}}
  • openai.BadRequestError: The API deployment for this resource does not exist
  • litellm.BadRequestError: LLM Provider NOT provided when using a proxy layer incorrectly

Other Possible Causes

1) Wrong environment variables or missing secrets

LlamaIndex reads provider config from env vars in many setups. If your app works locally but crashes in CI or Docker, check the runtime environment first.

# Example: missing key in container
import os
print(os.getenv("AZURE_OPENAI_API_KEY"))
print(os.getenv("AZURE_OPENAI_ENDPOINT"))

If either prints None, your deployment will fail before inference starts.

2) Mixing sync and async clients incorrectly

A common failure mode is calling async query APIs from sync code or reusing a client across event loops.

# BROKEN
response = await index.as_query_engine().query("What is this?")
# inside a normal def() function without an event loop

# FIXED
query_engine = index.as_query_engine()
response = query_engine.query("What is this?")

If you’re using async, keep the whole path async:

async def main():
    query_engine = index.as_query_engine()
    response = await query_engine.aquery("What is this?")

3) Context window too large for the selected model

If you stuff too many documents into one prompt, some providers fail with opaque backend errors that surface as a crash.

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=20)

Reduce retrieval size and chunking pressure:

query_engine = index.as_query_engine(similarity_top_k=5)

Also make chunks smaller when ingesting:

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)

4) Bad proxy / gateway configuration

If you route LlamaIndex through LiteLLM, OpenRouter, vLLM, Ollama, or an internal gateway, one bad base URL or auth header can produce a generic deployment failure.

# Example of an OpenAI-compatible client pointed at the wrong base URL
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    api_base="http://localhost:8000/v1",  # wrong if nothing serves there
    api_key="sk-not-used",
)

Make sure the server actually exposes an OpenAI-compatible /v1/chat/completions endpoint.

How to Debug It

  1. Print the exact exception

    • Don’t stop at “deployment crash”.
    • Capture the full stack trace and look for provider-specific classes like:
      • openai.BadRequestError
      • litellm.BadRequestError
      • httpx.ConnectError
      • aiohttp.ClientResponseError
  2. Verify provider config outside LlamaIndex

    • Test the same endpoint with a raw SDK call.
    • If raw OpenAI/Azure/OpenRouter fails too, this is not a LlamaIndex bug.
  3. Check the effective runtime settings

    • Log these values at startup:
      • model/deployment name
      • API base/endpoint
      • API version
      • auth headers / env vars
    • In production containers, assume env drift until proven otherwise.
  4. Reduce to one document and one query

    • Remove retrievers, agents, tools, and custom prompts.
    • Start with:
      from llama_index.core import VectorStoreIndex
      
      index = VectorStoreIndex.from_documents([documents[0]])
      print(index.as_query_engine().query("Summarize this"))
      
    • If that works, scale back up until it breaks.

Prevention

  • Pin your provider config in one place.

    • Don’t scatter model, api_base, and keys across modules.
    • Build one settings module and import it everywhere.
  • Add startup validation.

    • Fail fast if required env vars are missing.
    • Check that your deployment name exists before serving traffic.
  • Keep ingestion conservative.

    • Use sane chunk sizes.
    • Start with low similarity_top_k.
    • Avoid dumping huge PDFs into a single prompt path.

If you’re still seeing deployment crash after checking these items, treat it as a provider integration issue first and a LlamaIndex issue second. The fastest fix is usually in your endpoint name, auth config, or prompt size—not in the retrieval code itself.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides