How to Fix 'callback not firing in production' in LangGraph (Python)
What this error usually means
If your LangGraph callback works locally but never fires in production, the graph is usually executing, but the callback is attached to the wrong object, wrong lifecycle, or wrong runtime. In practice, this shows up when you move from a notebook or local invoke() test to an API server, worker, or async runtime.
The most common symptom is: no exception, no callback output, and your tracing/logging hook stays silent even though the graph returns a result.
The Most Common Cause
The #1 cause is attaching callbacks to the wrong layer of LangChain/LangGraph execution.
In production, people often pass callbacks to the graph node function, or forget that CompiledStateGraph.invoke() needs config-level callbacks. The graph runs, but your handler never receives on_chain_start, on_chain_end, or on_tool_end.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Callback passed into the task function directly | Callback passed through config={"callbacks": [...]} |
| Works in ad hoc tests only | Works consistently in invoke() / ainvoke() |
No BaseCallbackHandler events fired | Events fire as expected |
from langchain.callbacks.base import BaseCallbackHandler
from langgraph.graph import StateGraph, END
from typing import TypedDict
class MyHandler(BaseCallbackHandler):
def on_chain_start(self, serialized, inputs, **kwargs):
print("chain started")
def on_chain_end(self, outputs, **kwargs):
print("chain ended")
class State(TypedDict):
text: str
def node(state: State):
return {"text": state["text"].upper()}
graph = StateGraph(State)
graph.add_node("upper", node)
graph.set_entry_point("upper")
graph.add_edge("upper", END)
app = graph.compile()
# BROKEN: passing handler nowhere useful
result = app.invoke({"text": "hello"})
from langchain.callbacks.base import BaseCallbackHandler
from langgraph.graph import StateGraph, END
from typing import TypedDict
class MyHandler(BaseCallbackHandler):
def on_chain_start(self, serialized, inputs, **kwargs):
print("chain started")
def on_chain_end(self, outputs, **kwargs):
print("chain ended")
class State(TypedDict):
text: str
def node(state: State):
return {"text": state["text"].upper()}
graph = StateGraph(State)
graph.add_node("upper", node)
graph.set_entry_point("upper")
graph.add_edge("upper", END)
app = graph.compile()
# FIXED: callbacks passed in config
result = app.invoke(
{"text": "hello"},
config={"callbacks": [MyHandler()]},
)
If you are using LangSmith or OpenTelemetry wrappers, the same rule applies: pass them through runtime config or your app’s execution context. Don’t assume a callback attached at object construction will survive compilation and deployment.
Other Possible Causes
1) You are using async code but calling sync entrypoints
If your nodes are async and your server is async too, calling .invoke() can create confusing behavior. Use .ainvoke() and await it.
# Wrong
result = app.invoke({"text": "hello"})
# Right
result = await app.ainvoke({"text": "hello"}, config={"callbacks": [MyHandler()]})
This matters especially in FastAPI endpoints and background workers.
2) Your callback class is not a real LangChain handler
A plain Python class with methods named on_chain_start is not enough if it does not inherit from BaseCallbackHandler. LangChain dispatch looks for the handler interface.
# Wrong
class MyHandler:
def on_chain_start(self, serialized, inputs, **kwargs):
print("start")
# Right
from langchain.callbacks.base import BaseCallbackHandler
class MyHandler(BaseCallbackHandler):
def on_chain_start(self, serialized, inputs, **kwargs):
print("start")
If you want tool-level events too, implement the specific hooks you need:
- •
on_chain_start - •
on_chain_end - •
on_tool_start - •
on_tool_end - •
on_llm_start - •
on_llm_end
3) You are swallowing exceptions inside the node
A callback may not fire if your code catches exceptions too early and returns fallback output before LangGraph can emit normal lifecycle events. This often happens with broad except Exception: blocks.
def node(state):
try:
return {"text": risky_call(state["text"])}
except Exception:
return {"text": "fallback"} # hides real failure path
Fix it by logging and re-raising during diagnosis:
def node(state):
try:
return {"text": risky_call(state["text"])}
except Exception as e:
print(f"node failed: {e}")
raise
4) Your production worker strips callback context
Some queues and worker setups serialize only input payloads and drop execution metadata. If you enqueue raw state but not config/context, the worker runs without callbacks.
# Bad: only payload gets queued
job = {"text": "hello"}
# Better: carry execution config too
job = {
"input": {"text": "hello"},
"config": {
"callbacks": [MyHandler()],
"metadata": {"request_id": "req_123"},
},
}
This is common with Celery, RQ, SQS consumers, and custom job runners.
How to Debug It
- •
Confirm whether the graph executes at all
- •Add a plain
print()inside the first node. - •If that prints but callbacks do not fire, this is a handler/config issue.
- •If neither prints, your graph entrypoint or worker routing is broken.
- •Add a plain
- •
Check whether you are using the right entrypoint
- •For sync code use
.invoke(). - •For async code use
.ainvoke(). - •For streaming use
.stream()or.astream()depending on runtime. - •A mismatch here often explains “works locally” failures.
- •For sync code use
- •
Verify handler inheritance and config placement
- •Confirm your class extends
BaseCallbackHandler. - •Confirm callbacks are passed in
config={"callbacks": [...]}. - •Confirm you are passing config to the actual call site that executes the compiled graph.
- •Confirm your class extends
- •
Turn on verbose logging for one run
- •Print out request IDs and state keys before invoking.
- •If using LangSmith or tracing middleware, verify environment variables in production.
- •Check for suppressed exceptions in middleware or task runners.
Prevention
- •Always pass callbacks through runtime config at the final
.invoke()/.ainvoke()call site. - •Use one integration test that runs the compiled graph in the same runtime as production: FastAPI, Celery worker, Lambda runtime, whatever you deploy.
- •Keep a minimal diagnostic handler around that prints
on_chain_startandon_chain_endso you can confirm event propagation quickly.
If you want this to stop being a recurring incident class in your team:
- •standardize callback wiring in one helper function,
- •avoid broad exception swallowing inside nodes,
- •and test both sync and async execution paths before shipping.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit