Overview
A clean, powerful Python agent library with tools, MCP, streaming events, reasoning capture, and runtime policies.
Reasoning visibility
Stream live thinking blocks, tool calls, and outputs without a custom runtime.
Provider flexibility
Swap LLM providers with one config change while keeping your tools intact.
Production guardrails
Built-in retry policies, error recovery, hooks, and parallel execution.
MCP and tools
Mix Python tools, remote MCP servers, and connector-style integrations.
Start building powerful agents
The Shipit framework provides everything you need to build, deploy, and scale production-ready AI agents.
v1.0.8 — Five flagship features that genuinely beat LangChain
- Structured output with auto-retry —
same-conversation validation retry (LangChain's
OutputFixingParserruns a separate LLM call). Streaming partial JSON. Pydantic + JSON Schema, both work. - Verifier network — a second cheap LLM vetoes hallucinated tool calls and detects stalling. Process supervision in one constructor argument.
- Episodic memory consolidation — distill conversations into durable facts; forgetting curve + core memory promotion. ChatGPT-style memory, principled, self-hostable.
- Time-travel replay — load any saved trace, fork from any event, edit the prompt, resume on a fresh agent. Replay.io for AI agents.
- ComputerUseAgent — drive a browser by showing screenshots to a vision-capable LLM. Anthropic native + plain-text fallback for any vision LLM. Self-hosted Devin/Operator.
1508 tests passing (+318 new), zero regressions. All five
features available on both Agent and DeepAgent from a single
from shipit_agent import ….
v1.0.7 — Agents for every role
12 new tools (GitHub, GitLab, SQL, Vision, PDF, Figma, Salesforce, Stripe, Sheets, Zendesk, LinkedIn read-only) + 7 persona specialists
- LangSmith / OpenTelemetry trace exporters.
v1.0.6 — Bulletproof 24h Autopilot
Long-running, budget-gated, checkpointed runs that survive crashes. Fan-out parallelism, reflection critic, artifact collection.
SHIPIT Agent is a standalone Python agent library focused on a clean, production-grade runtime:
- Bring your own LLM — or use any of nine built-in provider adapters (OpenAI, Anthropic, Bedrock, Gemini, Vertex, Groq, Together, Ollama, LiteLLM)
- Tools, MCP, connectors — Python tools, remote MCP servers, 18 SaaS connectors (Gmail, Drive, Slack, GitHub, GitLab, Linear, Jira, Notion, Confluence, Salesforce, Stripe, Figma, Sheets, Zendesk, LinkedIn, PDF, SQL, custom HTTP APIs)
- Five flagship power features (v1.0.8) — structured output with auto-retry, verifier network, memory consolidation, time-travel replay, computer-use agent
- Skills marketplace — packaged or custom skills steer agent behavior and reusable workflows
- Streaming events — including reasoning / thinking blocks as they happen
- Tracing — file-based, OpenTelemetry, or LangSmith out of the box
- Cost tracking + budgets — every LLM call accounted; hard caps per run
- Autopilot — 24-hour budget-gated runs that survive crashes via JSON checkpoints
- Multi-agent — Supervisor/Worker pattern, AgentTeam orchestration, sub-agent delegation
- DeepAgent — power-user wrapper with verify / reflect / goal / RAG layers
Built for developers building real production agents — agents that beat ChatGPT and Operator on observability, cost control, self-hosting, and breadth of integration. Library-level API the whole way down.
Install
pip install shipit-agentWith optional extras:
pip install 'shipit-agent[openai]' # OpenAI SDK
pip install 'shipit-agent[anthropic]' # Anthropic SDK (native thinking blocks)
pip install 'shipit-agent[litellm]' # LiteLLM (Bedrock, Gemini, Groq, Together, …)
pip install 'shipit-agent[playwright]' # In-process browser for open_url and web_search
pip install 'shipit-agent[all]' # Everything30-second example
from shipit_agent import Agent
from shipit_agent.llms import OpenAIChatLLM
agent = Agent.with_builtins(llm=OpenAIChatLLM(model="gpt-4o-mini"))
for event in agent.stream("Search the web for today's Bitcoin price in USD."):
print(event.type, event.message)Works with any provider — swap
OpenAIChatLLMforAnthropicChatLLM,BedrockChatLLM,VertexAIChatLLM,GeminiChatLLM,GroqChatLLM,TogetherChatLLM,OllamaChatLLM, orLiteLLMChatLLM, or usebuild_llm_from_env()to pick from.env. The agent code never changes. See LLM providers — use any model.
Emits events like:
run_started Agent run started
step_started LLM completion started
reasoning_started 🧠 Model reasoning started
reasoning_completed 🧠 Model reasoning completed
tool_called Tool called: web_search
tool_completed Tool completed: web_search
run_completed Agent run completedPower features (v1.0.8)
Five flagship capabilities, all reachable from from shipit_agent import …,
all available on both Agent AND DeepAgent.
Structured output with auto-retry
Pass a Pydantic model or JSON Schema to agent.run(output_schema=…).
Get a typed result.parsed back. On parse failure, the runtime retries
inside the same conversation — no separate fixing LLM. Streams partial
JSON as tokens arrive.
Verifier network — process supervision
A second cheap LLM vetoes hallucinated tool calls before they fire, and detects when the agent is stalling. Both checks fail open (verifier failures never block the agent). One constructor arg.
Episodic memory consolidation
Distill conversations into 3-8 durable facts. Forgetting curve so old facts decay. Frequently-retrieved facts promote to a "core memory" set visible in the system prompt. ChatGPT-style memory, principled.
Time-travel replay
Load any saved trace, fork from any event, edit the prompt, resume on a fresh agent. Side-by-side diff of two runs. Replay.io for AI agents — open, library-level, no SaaS required.
Browser automation (ComputerUseAgent)
Drive a real browser by showing screenshots to a vision-capable LLM.
Anthropic native + plain-text fallback for any vision LLM. Use it
standalone or plug it into your main Agent as a browser_use tool.
+337 unit tests · 1527 passing · 0 regressions — every feature is reachable from a single
from shipit_agent import …and works on bothAgentandDeepAgent.
Why SHIPIT Agent
Live reasoning events
Extended thinking blocks from o1/o3/gpt-5/Claude/gpt-oss are automatically extracted and streamed as reasoning_started / reasoning_completed events. Your UI can show a live "Thinking" panel for free.
Truly incremental streaming
agent.stream() runs the agent on a background thread and yields events through a queue as they happen. Works in Jupyter, VS Code, WebSocket, SSE, and terminals.
Bulletproof Bedrock tool pairing
Every toolUse gets a paired toolResult. Planner output is injected as user context, not orphan tool-results. Hallucinated tool names get synthetic error results. Multi-iteration Bedrock loops just work.
Semantic tool discovery
tool_search lets the agent ask "which tool should I use for X?" and get a ranked shortlist. No more 28-tool context bloat, no more tool hallucinations.
Zero-friction provider switching
Edit one line in .env — SHIPIT_LLM_PROVIDER=openai — and build_llm_from_env() does the rest. Seven providers supported out of the box.
Playwright-powered open_url
In-process Chromium fetches JS-rendered pages with a realistic UA, handles anti-bot 503s, and falls back to stdlib urllib if Playwright isn't installed. No external scraper services.
Parallel tool execution
When the LLM returns multiple tool calls, run them concurrently with parallel_tool_execution=True. Results stay in order. Typically 2-3x faster for multi-tool turns.
Hooks & middleware
AgentHooks with @on_before_llm, @on_after_llm, @on_before_tool, @on_after_tool for cost tracking, rate limiting, content filtering, and guardrails. No subclassing.
Async runtime
AsyncAgentRuntime with async run() and async stream() for FastAPI, Starlette, and modern async Python. Same features as the sync runtime.
Graceful error recovery
Tool failures produce error messages instead of crashing the run. The LLM sees the error and can try a different approach. Safer retry defaults prevent retrying on bugs.
Next steps
- Install and run the quick start — get an agent running in five minutes
- Explore streaming events — understand the 14 event types and what they carry
- Reasoning and thinking steps — render a live "Thinking" panel in your UI
- Create a custom tool — build a new tool from scratch
- Use skills — packaged skills, custom skills, Agent, and DeepAgent workflows
- MCP integration — attach remote MCP servers to extend capabilities
- Parallel tool execution — speed up multi-tool turns
- Hooks & middleware — add cost tracking, logging, and guardrails
- Async runtime — use with FastAPI and async Python
- Context window management — track tokens and manage context limits
- Error recovery — graceful failure handling and retry policies
Try it now — runnable examples
The repo ships with 7 numbered, copy-pasteable examples covering every major feature. Pick one and run it in 30 seconds.
| # | What | Run |
|---|---|---|
| 1 | Hello, agent. The shortest possible runnable example | python examples/01_hello_agent.py |
| 2 | Live streaming with colored reasoning events | python examples/02_streaming_with_reasoning.py |
| 3 | Same agent, 5 different LLM providers back-to-back | python examples/03_provider_swap.py |
| 4 | End-to-end research workflow with web search + URL fetching | python examples/04_research_agent.py "your question" |
| 5 | Custom tools — function-style and class-style | python examples/05_custom_tool.py |
| 6 | Persistent chat session with file-backed memory | python examples/06_chat_session.py |
| 7 | Semantic tool discovery with tool_search | python examples/07_tool_search.py |
See the full examples README →
Provider compatibility matrix
| Provider | Reasoning blocks | Tool calling | Streaming | Bedrock pairing | Built-in tools |
|---|---|---|---|---|---|
OpenAI (o1, o3, o4, gpt-5) | ✅ Native | ✅ | ✅ | n/a | ✅ |
OpenAI (gpt-4o, gpt-4o-mini) | ❌ | ✅ | ✅ | n/a | ✅ |
Anthropic (claude-opus-4, claude-3.7) | ✅ Native (with thinking_budget_tokens) | ✅ | ✅ | n/a | ✅ |
AWS Bedrock (gpt-oss-120b) | ✅ Via LiteLLM | ✅ | ✅ | ✅ Bulletproof | ✅ |
AWS Bedrock (anthropic.claude-*) | ✅ Via LiteLLM | ✅ | ✅ | ✅ Bulletproof | ✅ |
Google Gemini (gemini-1.5-pro) | ❌ | ✅ | ✅ | n/a | ✅ |
| Google Vertex AI | ❌ | ✅ | ✅ | n/a | ✅ |
Groq (llama-3.3-70b) | ❌ | ✅ | ✅ | n/a | ✅ |
| Together AI | ❌ | ✅ | ✅ | n/a | ✅ |
| Ollama (local) | ❌ | ✅ | ✅ | n/a | ✅ |
| DeepSeek R1 (via LiteLLM proxy) | ✅ Native | ✅ | ✅ | n/a | ✅ |
| LiteLLM Proxy (self-hosted gateway) | ✅ Pass-through | ✅ | ✅ | n/a | ✅ |
Tip: if you want a "Thinking" panel UI without paying for o1/Claude, AWS Bedrock's
openai.gpt-oss-120b-1:0is the cheapest reasoning-capable model in the matrix and ships withAgent.with_builtins(llm=BedrockChatLLM())out of the box.
What you get vs. what you don't
| ✅ shipit-agent does | ❌ shipit-agent does NOT do |
|---|---|
| Run agents with tools, MCP, memory, sessions | Train models or fine-tune |
| Stream events incrementally as they happen | Provide a hosted control plane |
| Extract reasoning blocks from any provider | Replace LangChain / LangGraph / CrewAI wholesale |
| Guarantee Bedrock tool-pairing correctness | Manage your cloud infrastructure |
| Support 9 LLM providers via one API | Lock you into a specific vendor |
| Ship with 28+ built-in tools | Force you to use any of them |
| Stay out of your way (small, focused runtime) | Hide the agent loop behind abstractions |
This is a library, not a framework. The runtime is small enough to read in one sitting (shipit_agent/runtime.py is under 400 lines). Bring your own LLM, tools, and storage; the runtime composes them and gets out of the way.