LlamaIndex Tutorial (Python): deploying to AWS Lambda for advanced developers
This tutorial shows how to package a LlamaIndex-powered Python function and deploy it to AWS Lambda with a clean handler, minimal cold-start overhead, and environment-based configuration. You need this when you want an LLM-backed retrieval endpoint that scales on demand without running a long-lived server.
What You'll Need
- •Python 3.11
- •AWS account with permissions for:
- •Lambda
- •IAM
- •CloudWatch Logs
- •AWS CLI configured locally
- •A working OpenAI API key
- •These Python packages:
- •
llama-index - •
llama-index-llms-openai - •
llama-index-embeddings-openai - •
boto3if you want to script deployment later
- •
- •A small local dataset, or a plan to fetch docs from S3 / DynamoDB / an internal API
- •A deployment method:
- •ZIP upload, or
- •AWS SAM, or
- •Serverless Framework
Step-by-Step
- •Create a Lambda-friendly project layout and keep initialization outside the handler. The main rule is simple: build the index once per container lifecycle, not on every invocation.
# app.py
import os
from llama_index.core import Settings, VectorStoreIndex, Document
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_key=os.environ["OPENAI_API_KEY"],
)
DOCUMENTS = [
Document(text="Lambda is a serverless compute service."),
Document(text="LlamaIndex helps build retrieval and agent workflows."),
]
INDEX = VectorStoreIndex.from_documents(DOCUMENTS)
QUERY_ENGINE = INDEX.as_query_engine(similarity_top_k=2)
- •Add a handler that accepts API Gateway-style input and returns JSON. Keep the response shape boring and explicit so debugging in CloudWatch is straightforward.
# app.py continued
import json
def lambda_handler(event, context):
body = event.get("body") or "{}"
payload = json.loads(body) if isinstance(body, str) else body
question = payload.get("question", "What is LlamaIndex?")
response = QUERY_ENGINE.query(question)
return {
"statusCode": 200,
"headers": {"Content-Type": "application/json"},
"body": json.dumps(
{
"question": question,
"answer": str(response),
}
),
}
- •Package dependencies into the Lambda deployment artifact. For advanced setups, use a build directory so you can reproduce the same artifact locally and in CI.
mkdir -p build
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install \
llama-index \
llama-index-llms-openai \
llama-index-embeddings-openai \
-t build
cp app.py build/
cd build
zip -r ../lambda-package.zip .
- •Create the Lambda function with environment variables for secrets. Use Python 3.11 and give the function enough memory to avoid painful cold starts from dependency imports.
aws lambda create-function \
--function-name llamaindex-query \
--runtime python3.11 \
--handler app.lambda_handler \
--zip-file fileb://lambda-package.zip \
--role arn:aws:iam::123456789012:role/lambda-execution-role \
--timeout 30 \
--memory-size 1024 \
--environment Variables="{OPENAI_API_KEY=sk-your-key}"
- •Invoke the function directly before wiring API Gateway. This catches packaging mistakes fast, especially missing wheels or bad import paths.
aws lambda invoke \
--function-name llamaindex-query \
--payload '{"body":"{\"question\":\"What does this Lambda do?\"}"}' \
response.json
cat response.json
- •If you want better performance on real workloads, move document loading out of the code bundle and into S3 or another managed store. That keeps deployments small and lets you refresh content without rebuilding the whole function.
# app_s3.py
import os
import json
import boto3
from llama_index.core import Settings, VectorStoreIndex, Document
s3 = boto3.client("s3")
bucket = os.environ["DOC_BUCKET"]
key = os.environ["DOC_KEY"]
obj = s3.get_object(Bucket=bucket, Key=key)
text = obj["Body"].read().decode("utf-8")
INDEX = VectorStoreIndex.from_documents([Document(text=text)])
QUERY_ENGINE = INDEX.as_query_engine()
def lambda_handler(event, context):
payload = json.loads(event.get("body") or "{}")
result = QUERY_ENGINE.query(payload.get("question", "Summarize the document"))
return {"statusCode": 200, "body": json.dumps({"answer": str(result)})}
Testing It
Start by checking CloudWatch Logs for import errors, missing environment variables, or timeouts during startup. If the function returns a blank body or a generic error, your first suspect should be JSON parsing in the handler event shape.
Then test both direct invocation and API Gateway integration separately. Direct invocation validates your Python code; API Gateway validates request mapping and response formatting.
Watch cold-start duration closely if you add more dependencies or larger indexes. For production use, keep the deployment package lean and push large corpora into S3, DynamoDB, OpenSearch, or a hosted vector store.
If responses look correct but slow down under load, increase memory before doing anything else. On Lambda, more memory also means more CPU, which usually helps LlamaIndex startup and inference latency.
Next Steps
- •Add API Gateway + Lambda proxy integration so your function becomes an HTTP endpoint.
- •Move from in-memory documents to a persistent vector store like Pinecone, OpenSearch Serverless, or PostgreSQL pgvector.
- •Add structured tracing with AWS X-Ray or OpenTelemetry so you can measure retrieval latency separately from model latency.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit