How to Fix 'tool calling failure when scaling' in LangGraph (Python)
When LangGraph throws a tool calling failure when scaling error, it usually means your agent worked in a small test run, then broke once you added parallelism, more nodes, or longer conversations. In practice, this often shows up as tool calls not being routed back into the graph state correctly, or messages getting mutated in a way that only fails under load.
The key point: this is rarely a “LangGraph is broken” issue. It’s usually a state shape, message handling, or concurrency bug that scaling exposed.
The Most Common Cause
The #1 cause is incorrect handling of ToolMessage / AIMessage.tool_calls across graph steps, especially when people manually append messages or return partial state from nodes.
A common broken pattern is to call the model, detect tool calls, but fail to preserve the full message history and tool response contract that LangGraph expects.
| Broken pattern | Fixed pattern |
|---|---|
Manually appending raw dicts or overwriting messages | Returning proper BaseMessage objects and using add_messages |
Dropping the assistant message that contains tool_calls | Keeping the assistant message in state until the tool node responds |
| Returning only the latest message instead of full state updates | Returning incremental updates through LangGraph reducers |
Broken code
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
class State(TypedDict):
messages: list
def agent_node(state: State):
# BROKEN: overwrites messages and may lose tool call context
response = llm.invoke(state["messages"])
return {"messages": [response.content]} # wrong type, wrong shape
def tool_node(state: State):
# BROKEN: assumes tool call exists without preserving assistant msg
return {"messages": [{"role": "tool", "content": "done"}]}
graph = StateGraph(State)
graph.add_node("agent", agent_node)
graph.add_node("tool", tool_node)
graph.set_entry_point("agent")
graph.add_edge("agent", END)
app = graph.compile()
This kind of code often works in trivial tests and then fails with errors like:
- •
langgraph.errors.InvalidUpdateError: Expected dict, got str - •
langchain_core.messages.base.BaseMessage expected - •
Tool call not found in AIMessage - •
tool calling failure when scaling
Fixed code
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, ToolMessage
llm = ChatOpenAI(model="gpt-4o-mini")
class State(TypedDict):
messages: Annotated[list, add_messages]
def agent_node(state: State):
response = llm.invoke(state["messages"])
# Keep the AIMessage intact so tool_calls remain available
return {"messages": [response]}
def tool_node(state: State):
last_ai_msg = state["messages"][-1]
tool_call = last_ai_msg.tool_calls[0]
result = f"Executed {tool_call['name']} with args {tool_call['args']}"
return {
"messages": [
ToolMessage(
content=result,
tool_call_id=tool_call["id"],
)
]
}
graph = StateGraph(State)
graph.add_node("agent", agent_node)
graph.add_node("tool", tool_node)
graph.set_entry_point("agent")
# route based on whether the last AI message has tool calls
# (pseudo-router omitted for brevity)
app = graph.compile()
The important changes:
- •Use
Annotated[list, add_messages]so LangGraph merges messages correctly. - •Return actual message objects like
AIMessageandToolMessage. - •Preserve the assistant message that contains
tool_calls. - •Match each
ToolMessage.tool_call_idto the original call ID.
Other Possible Causes
1) Tool schema mismatch
If your tool signature doesn’t match what the model emits, LangChain can’t parse arguments cleanly.
# Broken
@tool
def lookup_policy(policy_id: int): # model sends string IDs often
...
# Better
@tool
def lookup_policy(policy_id: str):
...
If you’re using Pydantic schemas:
class LookupPolicyInput(BaseModel):
policy_id: str
@tool(args_schema=LookupPolicyInput)
def lookup_policy(policy_id: str):
...
2) Non-deterministic shared state under concurrency
Scaling often means multiple runs or branches hit shared mutable objects.
# Broken: shared global list
shared_messages = []
def node(state):
shared_messages.extend(state["messages"])
Use per-run state only:
def node(state):
local_messages = list(state["messages"])
If you’re storing checkpoints, make sure your checkpointer is thread-safe and keyed by unique thread/session IDs.
3) Returning invalid node outputs
LangGraph nodes must return a dict matching the graph state contract. Returning strings or nested junk causes failures that look unrelated at first.
# Broken
def router(state):
return "tools"
# Fixed
def router(state):
return {"next": "tools"}
You’ll often see errors like:
- •
langgraph.errors.InvalidUpdateError - •
Expected dict at path ... - •
Invalid concurrent update
4) Wrong edge routing after a tool call
If your conditional edge doesn’t route back to the agent after tools run, the graph can stall or recurse incorrectly.
# Broken routing idea:
graph.add_conditional_edges("agent", should_use_tool, {"tools": "tools"})
graph.add_edge("tools", END) # ends too early
# Better:
graph.add_conditional_edges("agent", should_use_tool, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")
That loop back matters. The assistant needs to see the tool result before it generates the final answer.
How to Debug It
- •
Inspect the last AI message
- •Print
state["messages"][-1]. - •Confirm it is an
AIMessage, not a string or dict. - •Check whether
.tool_callsexists and has valid IDs.
- •Print
- •
Verify your reducer
- •If you’re managing messages manually, switch to:
messages: Annotated[list, add_messages] - •Without this, concurrent updates can overwrite each other.
- •If you’re managing messages manually, switch to:
- •
Log every node output
- •Each node should return a dict.
- •Log keys and types:
print(type(output), output.keys()) - •Look for accidental returns like
"done"or[message].
- •
Run with one thread first
- •Disable parallel branches and test a single session.
- •If it works serially but fails under load, suspect shared state or checkpoint collisions.
- •Make sure every run has a unique session/thread ID.
Prevention
- •
Use LangGraph’s message reducer from day one:
- •
Annotated[list, add_messages] - •Don’t hand-roll message merging unless you really need to.
- •
- •
Keep tools strict:
- •Use typed args schemas.
- •Prefer strings for external IDs unless you control formatting end-to-end.
- •
Treat graph nodes as pure functions:
- •Input state in.
- •Valid partial state out.
- •No mutation of globals, no hidden side effects.
If you’re seeing this error only after scaling up workers or traffic, start with message integrity first. In LangGraph, most “scaling” failures are really state-contract failures that concurrency made visible.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit