Error Recovery

2 min read
5 sections
Edit this page

SHIPIT Agent handles failures gracefully at every level — LLM provider errors, tool execution failures, and hallucinated tool names all produce recoverable error messages instead of crashing the agent run.

How error recovery works

When a tool fails after exhausting retries, the runtime produces an error ToolResult message and sends it back to the LLM. The LLM sees the error and can decide to try a different tool, adjust its approach, or report the issue to the user.

bash
LLM: "Call web_search with query='latest news'"


    web_search raises ConnectionError

    retry 1 → still fails


    Runtime creates error message:
    "Error running tool 'web_search': connection refused"


    LLM sees error, decides to try open_url instead


    Agent continues running

This is the same pattern used for hallucinated tool names — every tool call gets a paired result message, whether success or error, keeping the conversation balanced for all providers (especially Bedrock).

Retry policy

python
from shipit_agent import Agent, RetryPolicy

agent = Agent(
    llm=llm,
    retry_policy=RetryPolicy(
        max_llm_retries=2,           # retry LLM calls up to 2 times
        max_tool_retries=1,          # retry tool calls up to 1 time
        retry_on_exceptions=(        # only retry these exception types
            ConnectionError,
            TimeoutError,
            OSError,
        ),
    ),
)

Default exceptions

The default retry_on_exceptions is (ConnectionError, TimeoutError, OSError) — network and I/O errors that are typically transient. This is intentionally narrow:

Exception typeRetried by defaultWhy
ConnectionErrorYesNetwork hiccup, retry likely succeeds
TimeoutErrorYesServer slow, retry may succeed
OSErrorYesI/O issue, often transient
RuntimeErrorNoUsually a bug, retrying won't help
ValueErrorNoBad data, same input = same error
TypeErrorNoCode bug, fix the code
KeyErrorNoMissing data, not transient

To retry on additional exceptions:

python
RetryPolicy(
    retry_on_exceptions=(ConnectionError, TimeoutError, OSError, RuntimeError),
)

Events emitted during failures

EventWhenKey payload
tool_retryTool failed, retryingattempt, error, iteration
tool_failedTool failed permanently (or hallucinated name)error, iteration
llm_retryLLM call failed, retryingattempt, error

Before vs. after (the old behavior)

ScenarioBefore (v1.0.0)After (v1.0.2)
Tool raises after retriesAgent crashes, caller gets exceptionError message sent to LLM, agent continues
Hallucinated tool nameError message sent to LLMError message sent to LLM (unchanged)
LLM provider errorRetried, then crashesRetried, then crashes (unchanged)

Breaking change from 1.0.0

If you were catching tool exceptions from agent.run(), note that tool failures no longer propagate as exceptions. The agent will continue running and include the error in its response. Check result.events for tool_failed events if you need to detect failures programmatically.