How to Integrate LangGraph for healthcare with Kubernetes for RAG

By Cyprian AaronsUpdated 2026-04-21

langgraph-for-healthcarekubernetesrag

Combining LangGraph for healthcare with Kubernetes gives you a clean way to run regulated RAG workflows with stateful orchestration and predictable infrastructure. LangGraph handles the multi-step agent logic, while Kubernetes gives you deployment, scaling, and isolation for PHI-sensitive workloads.

This is useful when your assistant needs to retrieve clinical policy, summarize patient records, route cases to specialist tools, and keep audit-friendly execution traces. You get a system that can reason over documents without turning your app into a pile of ad hoc API calls.

Prerequisites

•Python 3.10+
•
A Kubernetes cluster:
- •local: kind, minikube, or k3d
- •cloud: EKS, GKE, or AKS
•kubectl configured and pointing at your cluster
•A container registry for pushing images
•
LangGraph installed:
- •pip install langgraph langchain-openai
•
Kubernetes Python client installed:
- •pip install kubernetes
•
Access to your healthcare RAG data source:
- •vector store, document store, or FHIR-backed retrieval layer
•
Environment variables set:
- •OPENAI_API_KEY
- •KUBECONFIG if needed

Integration Steps

•Build the LangGraph workflow for healthcare RAG

Start with a graph that retrieves context, generates an answer, and stores traceable state. For healthcare, keep the state explicit so you can inspect what was retrieved and why.

from typing import TypedDict, List
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI

class RAGState(TypedDict):
    question: str
    docs: List[str]
    answer: str

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def retrieve(state: RAGState) -> RAGState:
    # Replace with vector DB / FHIR search / document service call
    docs = [
        "Clinical guideline: hypertension follow-up within 4 weeks.",
        "Medication note: avoid NSAIDs in CKD stage 3."
    ]
    return {**state, "docs": docs}

def generate(state: RAGState) -> RAGState:
    prompt = f"""
    Question: {state['question']}
    Context: {state['docs']}
    Answer with concise clinical guidance.
    """
    response = llm.invoke(prompt)
    return {**state, "answer": response.content}

graph = StateGraph(RAGState)
graph.add_node("retrieve", retrieve)
graph.add_node("generate", generate)
graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "generate")
graph.add_edge("generate", END)

app = graph.compile()

•Package the graph as a service runnable in Kubernetes

Expose the graph through a small API so pods can process requests consistently. This keeps LangGraph execution inside the cluster where you control networking and secrets.

from fastapi import FastAPI
from pydantic import BaseModel

app_api = FastAPI()

class Query(BaseModel):
    question: str

@app_api.post("/rag")
async def rag(query: Query):
    result = app.invoke({"question": query.question, "docs": [], "answer": ""})
    return {"answer": result["answer"], "docs": result["docs"]}

Run this behind Uvicorn in your container image:

# main.py
# uvicorn main:app_api --host 0.0.0.0 --port 8080

•Deploy the service to Kubernetes

Use the Kubernetes Python client when you need to automate rollout checks or create namespace-scoped resources from code. For most teams, manifests are enough for deployment and the SDK is useful for operational tooling.

from kubernetes import client, config

config.load_kube_config()

namespace = "healthcare-rag"
deployment = client.V1Deployment(
    metadata=client.V1ObjectMeta(name="langgraph-rag"),
    spec=client.V1DeploymentSpec(
        replicas=2,
        selector=client.V1LabelSelector(
            match_labels={"app": "langgraph-rag"}
        ),
        template=client.V1PodTemplateSpec(
            metadata=client.V1ObjectMeta(labels={"app": "langgraph-rag"}),
            spec=client.V1PodSpec(containers=[
                client.V1Container(
                    name="api",
                    image="your-registry/langgraph-rag:latest",
                    ports=[client.V1ContainerPort(container_port=8080)],
                )
            ])
        )
    )
)

apps = client.AppsV1Api()
apps.create_namespaced_deployment(namespace=namespace, body=deployment)

If you prefer manifests, keep them minimal and versioned alongside the graph code.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: langgraph-rag
spec:
  replicas: 2
  selector:
    matchLabels:
      app: langgraph-rag
  template:
    metadata:
      labels:
        app: langgraph-rag
    spec:
      containers:
        - name: api
          image: your-registry/langgraph-rag:latest
          ports:
            - containerPort: 8080

•Connect retrieval services and secrets inside the cluster

Your graph should not hardcode database credentials or vector store endpoints. Mount them as Kubernetes Secrets and read them from environment variables in your Python code.

import os

VECTOR_DB_URL = os.environ["VECTOR_DB_URL"]
FHIR_BASE_URL = os.environ.get("FHIR_BASE_URL")

def retrieve(state):
    # Example placeholder for real retrieval call using env-configured services.
    docs = [
        f"Retrieved from {VECTOR_DB_URL}: discharge summary fragment",
        f"Retrieved from {FHIR_BASE_URL}: medication list fragment"
    ]
    return {**state, "docs": docs}

Create the secret in Kubernetes:

kubectl create secret generic rag-secrets \
  --from-literal=VECTOR_DB_URL='http://vectordb.default.svc.cluster.local' \
  --from-literal=FHIR_BASE_URL='http://fhir.default.svc.cluster.local'

Then reference it in your pod spec:

envFrom:
  - secretRef:
      name: rag-secrets

•Add health checks and scale based on workload

For production RAG systems, treat health endpoints as first-class citizens. Kubernetes uses these probes to restart bad pods before they start returning broken clinical answers.

@app_api.get("/healthz")
async def healthz():
    return {"status": "ok"}

And in Kubernetes:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
resources:
  requests:
    cpu: "250m"
    memory: "512Mi"

If traffic spikes during clinic hours, attach an HPA later. Start with stable requests/limits first so the graph runtime doesn’t get starved under load.

Testing the Integration

Use a simple request against the service once it’s running in-cluster or port-forwarded locally.

import requests

resp = requests.post(
    "http://localhost:8080/rag",
    json={"question": "What follow-up is recommended for hypertension?"}
)

print(resp.status_code)
print(resp.json())

Expected output:

200
{
  'answer': '...',
  'docs': [
    'Clinical guideline: hypertension follow-up within 4 weeks.',
    'Medication note: avoid NSAIDs in CKD stage 3.'
  ]
}

If you want to verify pod health from Python too:

from kubernetes import client, config

config.load_kube_config()
v1 = client.CoreV1Api()
pods = v1.list_namespaced_pod(namespace="healthcare-rag", label_selector="app=langgraph-rag")

for pod in pods.items:
    print(pod.metadata.name, pod.status.phase)

Real-World Use Cases

•Clinical policy assistant that retrieves internal guidelines and returns grounded answers with traceable graph state.
•Prior authorization workflow that routes cases through document retrieval, criteria checking, and escalation steps inside Kubernetes.
•Discharge summary copilot that pulls chart context from a vector store and generates patient-safe summaries under cluster-level controls.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit