Changelog
v1.0.12 — 2026-06-07
Claude API power — plus cross-provider prompt-cache accounting. v1.0.12 adds the Anthropic API's highest-leverage server features as first-class passthroughs — server-side tools, document citations, the Batch API, and interleaved thinking + server-side context editing — on top of v1.0.11's control plane. It also makes prompt caching honestly cross-provider: OpenAI's automatic cache reads are now surfaced for cost tracking. Each feature is honest about provider support; no public API was removed.
Server-side tools (Anthropic-hosted)
web_search(),code_execution(),computer_use(),bash(),text_editor()fromshipit_agent.llms— declare them in thetools=list you pass toAnthropicChatLLM.complete(...)(mixed freely with client-side tools). They run inside Anthropic's own sandbox — zero local infrastructure, no client-side tool loop.- Beta headers handled automatically —
code_executionandcomputer_useattach their betas and route to the beta endpoint;web_searchis GA and stays on the GA endpoint. - Surfaces in metadata —
LLMResponse.metadata["server_tool_use"]and["server_tool_results"], only when present. - Provider note — these are Anthropic API shapes (also reachable for Anthropic models via Bedrock / LiteLLM); other providers use their own native server tools. See Agent → Server-side tools.
Citations & the Batch API
- Citation document helpers —
text_document/pdf_document/url_pdf_document/content_documentfromshipit_agent.llms, withcitations.enabledon by default. Claude grounds its answer in the document and the cited spans are parsed intometadata["citations"]— verifiable RAG. - Batch API runtime —
BatchRequest,BatchResult, andBatchRuntime.run(...)(inshipit_agent.batch) wrap Anthropic's Messages Batches API for bulk, latency-tolerant runs at roughly 50% of standard per-token price.submit/status/results/cancelare exposed too. - Provider note — Anthropic citations and Anthropic batches today; OpenAI also has a Batch API and generalising the runtime is on the roadmap. See Agent → Citations & Batch API.
Interleaved thinking & context editing
AnthropicChatLLM(interleaved_thinking=True, thinking_budget_tokens=…)— the model thinks between tool calls. Theinterleaved-thinking-2025-05-14beta attaches only when both are set;metadata["thinking_blocks"]carries the signed thinking blocks for round-tripping.context_management=— forwarded as Anthropic'scontext_managementrequest param (with its beta header) so the API clears stale tool results server-side.- Provider note — extended / interleaved thinking is Anthropic; OpenAI reasoning models and Gemini thinking are the equivalents elsewhere (reasoning content is captured for all of them). See Agent → Interleaved thinking & context editing.
Cross-provider prompt caching
- OpenAI cached-token surfacing — OpenAI does automatic prompt caching;
shipit now reads
usage.prompt_tokens_details.cached_tokensintousage["cache_read_input_tokens"]— the same key theCostTrackerreads for Anthropic — so OpenAI cache reads bill at the cheaper rate.LiteLLMChatLLMforwards both shapes. Anthropic / Bedrock / Vertex keep explicitcache_controlbreakpoints (default on for Claude). Caching is cross-provider, not Anthropic-only. See Agent → Prompt caching.
v1.0.11 — 2026-06-07
A control plane for tool calls — plus prompt caching and a memory tool. v1.0.11 brings the Claude Code permission layer to the library: declarative allow/deny/ask rules, read-only plan mode, human-in-the-loop callbacks, and blocking/rewriting hooks. It also turns on Anthropic prompt caching by default for Claude-family models and adds an Anthropic-style memory tool. Folded in is the 1.0.10 bug-fix & hardening work. No public API was removed.
Control plane — permissions, plan mode & hooks
Agent(permission_mode=...)—"default","acceptEdits","plan", or"bypass", mirroring Claude Code.planmakes a run read-only;acceptEditsauto-approves file edits.PermissionEngine(allow=[...], deny=[...], ask=[...])—fnmatchglobs on tool name with a predictable precedence: deny > mode > allow > ask > callback > default. Pass it aspermissions=(also accepts a bare mode string or a kwargs dict).agent.plan(prompt)— one-call read-only planning: the agent may use read-only tools and writes a step-by-step plan instead of acting.permission_callback(name, args) -> PermissionResult | None— programmatic human-in-the-loop approval, consulted onaskrules and as a catch-all.- Blocking & rewriting hooks —
@hooks.on_before_toolmay return{"decision": "deny"}to block a call or aPermissionResultwithupdated_argumentsto rewrite it;@hooks.on_user_promptcan redact the incoming prompt. Hooks remain observe-only when they returnNone. - Denied calls are visible — a blocked tool emits a
tool_deniedevent and feeds the model awas NOT runtool message so it can recover. - New top-level exports:
PermissionEngine,PermissionResult,PermissionDecision. See Agent → Permissions, plan mode & blocking hooks.
Prompt caching
- On by default for Claude-family models.
AnthropicChatLLM(prompt_caching=True)andLiteLLMChatLLM(prompt_caching=True)cache the stable prefix (system prompt + tools) on Anthropic, Bedrock, and Vertex viacache_controlbreakpoints. - Cost-aware —
usage["cache_read_input_tokens"]andusage["cache_creation_input_tokens"]flow intoCostTracker; cache reads bill at roughly 10% of input. See Agent → Prompt caching.
Claude-style memory tool
ClaudeMemoryTool— Anthropic'smemory_20250818tool shape: a single command-driven tool (view/create/str_replace/insert/delete/rename) over a sandboxed memory directory (.shipit_workspace/memoriesby default) for cross-session learning. Attach viaAgent(tools=[ClaudeMemoryTool(...)]). See Tools → Claude-style memory tool.
Hardening (folded in from 1.0.10)
text_delta_callbackregression (v1.0.9) fixed — the runtime now inspects the adapter signature and only passes the callback to adapters that accept it, so every custom adapter works unchanged.- Multi-turn sessions no longer stack duplicate system prompts when reusing
a
session_store+session_id. - Tool security — Bash rejects command/process substitution and redirection;
open_urlis http(s)-only and blocks SSRF targets; the SQL read-only guard scans the whole statement and rejects stacked statements; OAuth validates the CSRF state nonce;edit_filerefuses non-UTF-8 files;FileCredentialStorechmods0600and writes atomically. - Reliability — MCP transports close on error; parallel tools get isolated
state; the iteration-cap turn is now accounted for;
CostTrackerflagshas_unknown_pricinginstead of silently billing$0. - 1742 tests passing (+180 new). 0 regressions.
v1.0.9 — 2026-05-14
Inline text streaming + multimodal media references. Two features that make shipit feel live in chat UIs: token-by-token text streaming for real-time typing, and first-class image/file references in prompts.
Inline text streaming
LLM.complete(text_delta_callback=…)— stream assistant text token-by-token as it's generated, instead of waiting for the full response. The callback fires for each incremental text chunk.AgentRuntimeemitstext_deltaevents — drive SSE or WebSocket consumers directly from the event stream for ChatGPT-style live typing in the browser.- Non-streaming behavior preserved — streaming is opt-in per call; omit the
callback and
complete()behaves exactly as before. - Implemented end-to-end for LiteLLM, and no-op-compatible for the other adapters (they return the full text in a single delta), so nothing breaks if a backend can't stream.
Multimodal media references
MediaReference— reference an image or file inside a prompt without inlining bytes; the runtime resolves it at send time.MediaStore— pluggable storage for media, withInMemoryMediaStorefor tests and short-lived runs andFileMediaStorefor on-disk persistence.extract_media_refs+build_multimodal_message— pull media references out of a prompt and assemble the provider-native multimodal message payload.- See Agent → Multimodal chat.
v1.0.8 — 2026-05-09
Structured output overhaul + verifier network. Two flagship features that genuinely beat LangChain on the surfaces it tries hardest at — and ship more broadly applicable wins than v1.0.7's connector explosion.
Structured output — same-conversation validation retry
Agent.run(prompt, output_schema=MyModel, max_validation_retries=2)— pass a Pydantic model or JSON Schema dict; get back a typedresult.parsed.- Auto-retry on validation failure inside the same conversation — when the
first parse fails, the runtime appends the bad assistant turn + a corrective
user turn ("that response could not be parsed: …") and retries. No separate
"fixing LLM" call (LangChain's
OutputFixingParserrequires one). - Streaming partial JSON parser —
parse_partial_json('{"a": "hel')returns{"a": "hel"}.StructuredOutput.stream(prompt)yields progressively richer dicts as tokens arrive, then a final validated typed object. StructuredOutput— standalone wrapper for one-shot extraction without the agent loop; same retry path, exposed publicly.- New top-level exports:
StructuredOutput,StructuredOutputResult,parse_partial_json. Newresult.parsedandresult.outputcorrected text semantics onAgent.run. - See Agent → Structured output.
Verifier network — process supervision
Agent(verifier=VerifierNetwork(llm=cheap_llm))— a second cheap LLM vetoes hallucinated tool calls before they fire and rates progress between iterations. Both checks fail open (verifier failures never block the agent).- Pre-tool veto — wraps every tool. Verifier returns
allow | veto | rewrite; vetoed calls become synthetic error tool-results so the agent re-plans without the bad action having actually run. - Progress check — after each iteration, scores progress 0-1. When the
score stays below
progress_thresholdforprogress_windowconsecutive iterations,maybe_nudge()returns a "you're stalling" message you can inject as a user turn. - Confidence-gated — verdicts below
veto_min_confidenceget downgraded toALLOW(avoid over-blocking on uncertain calls). - Hard caps —
max_pretool_calls_per_run,max_progress_calls_per_runso the verifier itself can't run away on cost. - Telemetry —
verifier.statsexposes per-run counters: vetoes, rewrites, nudges, score history. - LangGraph's
ToolNodehas no per-call gating. LangChain'sRunnableWithMessageHistoryhas no progress detector. Process supervision in shipit is one constructor argument. - See Agent → Verifier network.
Episodic memory consolidation
MemoryConsolidator(llm=cheap_llm).consolidate(memory=..., recent_messages=...)— LLM distills the last conversation into 3-8 durable facts and writes them toSemanticMemory. Categories (preference,project,goal,person,other) are tracked for filtering.- Forgetting curve —
consolidator.decay(knowledge, half_life_days=14)applies exponential decay to fact strength and prunes facts belowforgetting_threshold. Pure local arithmetic; no LLM call. - Core memory —
consolidator.core_memory(knowledge, top_k=5)returns the top-K most-load-bearing facts ranked bystrength + 0.1·log1p(retrievals). Inject into the system prompt every turn for ChatGPT-style "remembers things across sessions". - Retrieval bumping —
consolidator.record_retrieval(knowledge, [fact_texts])increments retrieval counts. Frequently-retrieved facts rise to core memory automatically. - New top-level exports:
MemoryConsolidator,DistilledFact,ConsolidationResult. - ChatGPT's Memories feature is
add_fact(text)with no decay, no retrieval-based promotion. Ours is principled and self-hostable. - See Agent → Episodic memory consolidation.
Time-travel replay
TraceReplayer.from_store(store, trace_id)— load any saved trace and walk events programmatically. Three constructors:from_record,from_store,from_file.replayer.fork(at_event=N, edit_user_message='...')— capture the conversation state at any event, optionally with a tweaked user prompt. Returns aReplayCheckpoint.checkpoint.continue_from(agent=fresh_agent)— resume the run on a freshAgent, withagent.historypre-filled. Forwards arbitraryAgent.runkwargs (e.g.output_schema=).diff_traces(left, right)— side-by-side comparison. Reports matched events, divergence point, type mismatches, and only-in-left / only-in-right tails..to_lines()for human-readable rendering.- New top-level exports:
TraceReplayer,ReplayCheckpoint,ReplayResult,ForkPoint,TraceDiff,diff_traces. - LangSmith's Playground is SaaS-only. Inngest's branching is SaaS-only.
Ours is library-level, open-source, and works against your existing
FileTraceStore. - See Agent → Time-travel replay.
ComputerUseAgent — browser automation
ComputerUseAgent(llm=, browser=, goal=)— screenshot → reason → act loop. Show a screenshot to a vision-capable LLM, parse a structured action back, execute, repeat until DONE.PlaywrightBrowserSession.launch(headless=True)— production driver. Context-manager support;pip install playwright && playwright install chromiumto enable.MockBrowserSession— deterministic test double that records every call. Unit-test computer-use logic without spawning a browser.- Two action emit shapes — Anthropic's native
computer-usetool (structuredtool_useblock) AND plain-text fallback (ACTION: click 100,200) for any vision LLM. parse_action(raw)— pure parser, no IO. Handles both shapes plus prose-wrapped responses.- Recovery — when an action raises, the agent surfaces the error back to the model as a user message. Production-ready resilience without extra code.
- New top-level exports:
ComputerUseAgent,BrowserSession,MockBrowserSession,PlaywrightBrowserSession,ComputerAction,ComputerUseResult,ActionKind,ActionRecord,parse_action. - Devin / Multi-On / OpenAI Operator are SaaS products. Ours is a library — self-host, plug into your own loop, fork the implementation.
- See Agent → ComputerUseAgent.
Tests + docs
- +318 unit tests (1190 → 1508), zero regressions, all old tests still pass.
- Five new notebooks —
54_structured_output_with_retry.ipynb,55_verifier_network.ipynb,56_episodic_memory_consolidation.ipynb,57_time_travel_replay.ipynb,58_computer_use_agent.ipynb. - Five new docs pages with full API reference, configuration deep dives, cost analysis, real-life examples, and beat-LangChain / Operator / ChatGPT comparison tables.
v1.0.7 — 2026-04-24
Agents for every role. Twelve new tools, nine new persona specialists, seven persona walk-through notebooks. shipit-agent is no longer only a developer-agent framework — it ships agents for developers, designers, sales reps, PMs, data analysts, finance, customer support, and recruiters. 1190 unit tests, 286 new in this release, zero regressions.
See RELEASE_NOTES_1.0.7.md for the full breakdown.
v1.0.6 — 2026-04-21
Autopilot — the long-running runtime. Plus 7 new role specialists, 3 new tools, and 8 new notebooks. Autopilot turns any agent into a budget-gated, checkpointed, streaming worker that runs until every success criterion is met. Fan-out dispatches N children in parallel. A reflection critic short-circuits the loop once a confident reviewer confirms the goal. Artifacts capture code blocks, markdown docs, and tool outputs as structured deliverables. A scheduler daemon drains a persistent JSON queue for 24-hour operation. 8 new Bedrock-Llama notebooks. 805 total tests. All passing.
Autopilot — long-running runtime
Autopilot(llm, goal, budget, …)— composesGoalAgentwith budget gates, atomic checkpoints, heartbeats, and a live event stream.BudgetPolicy(max_seconds, max_tool_calls, max_tokens, max_dollars, max_iterations)— every axis independently honored; set any toNone/0to disable.- Goal-satisfaction termination, not step count. Stops the moment every criterion is verified OR any budget trips.
- Atomic JSON checkpoints per iteration —
~/.shipit_agent/checkpoints/<run_id>.json. Crash →autopilot.resume(run_id)picks up at the next iteration. autopilot.stream(run_id)— iterator of{kind, ...}events:autopilot.run_started,autopilot.iteration,autopilot.heartbeat,autopilot.criteria_satisfied,autopilot.budget_exceeded,autopilot.result.default_heartbeat_stderr— drop-in sink. Custom callables (Slack / Datadog / webhook) just as easy.
Reflection critic
Critic(llm=..., confidence_threshold=0.75)— scores every iteration's output against the goal's criteria and feeds suggestions into the next iteration's prompt.critic=Trueon Autopilot to use your run's LLM as a self-check; pass aCritic(llm=reviewer_llm)for a dedicated stronger reviewer.- Confidence-gated termination — flag-flips only count when the critic meets the confidence gate. Low-confidence "yes" still logs feedback but does not halt.
- JSON-tolerant parsing — handles fenced
```json, extra prose, padding/trimming of criteria, and garbage input without raising. - New event kind
autopilot.criticon every iteration's stream.
Artifacts — structured deliverables
ArtifactCollector— collectsArtifact(kind, name, content, language, iteration)during the run.- Auto-extraction from every iteration's output — fenced code blocks (
kind="code", with language) and top-level markdown docs (kind="markdown"). - Tool-metadata ingestion — tools that declare
{"artifact": True, "kind": ..., "name": ..., "content": ...}in their result metadata are captured explicitly. - Optional disk persistence — one JSON file per artifact, handy for CI build outputs.
- New event kind
autopilot.artifact; finalAutopilotResult.artifactscarries the full list.
Parallel fan-out
autopilot.fanout(items, objective_template, criteria_template, max_parallel, child_budget_frac)— ThreadPoolExecutor-backed N-way parallelism.- Per-child budget scaling — each child inherits
parent_budget * child_budget_frac(default 20%). Keeps aggregate spend bounded on 50-item batches. autopilot.fanout_stream()— live per-child events (autopilot.fanout_child) for dashboard rendering.FanoutResult(children, aggregated_output, wall_seconds, failed)— rolled-up status (completed|partial|failed), ordered children, default markdown digest or customaggregator.
Scheduler daemon
SchedulerDaemon(llm_factory, queue_path, tick_seconds)— persistent JSON goal queue at~/.shipit_agent/autopilot-queue.json.enqueue(),list_queue(),remove(),run_once(),run_forever(). Stateless daemon; crash-safe.- Heartbeat events on idle so you can wire Slack / Datadog telemetry.
- CLI:
shipit autopilot,shipit daemon,shipit queue {add,list,remove}. Systemd / launchd / Docker recipes in the docs.
7 new role specialists
- Engineering —
generalist-developer,debugger - Design —
design-reviewer - Product —
product-manager - Go-to-market —
sales-outreach,customer-success,marketing-writer - Auto-applied to
agents.jsonon import — 40 → 47 specialists total.
3 new power tools
computer_use— drive the local desktop (screenshots, click, type, drag, key chords). Platform backends for macOS (cliclick/osascript), Linux (xdotool/scrot), Windows (PowerShell). Graceful install hints when a dep is missing.hubspot_ops— HubSpot CRM v3 REST wrapper. Search / get / create contacts, companies, deals; attach notes. Auth viaHUBSPOT_TOKENenv.research_brief— one-call research primitive. Web search + top-page skim + numbered citations. No API key (DuckDuckGo HTML). Optionaldeep=Truefetches each source page for richer summaries.
Notebooks — 37 through 44
- 37 — Autopilot quickstart.
- 38 — Live streaming (
autopilot.stream(),render_stream, custom heartbeats). - 39 — Persistence, resume, scheduler daemon.
- 40 — Developer / Debugger / Researcher specialists.
- 41 — Design / PM / Sales / CS / Marketing specialists.
- 42 —
computer_use/hubspot_ops/research_brieftools. - 43 — Fan-out · Critic · Artifacts.
- 44 — The Complete Tour — every feature end-to-end in one notebook.
All notebooks use build_llm_from_env("bedrock") — default is Bedrock Llama 4 Scout, matches the 01–36 series.
Other changes
AutopilotResultgrewartifacts: list[dict]andcritic_verdict: dict. Existing fields unchanged.Autopilotacceptscritic=True | Critic(...)andartifacts=True | ArtifactCollector(...).- Fan-out helpers (
_scale_budget,_slug,_rollup_status) are pure functions — re-use them in your own dispatchers. - 39 new tests (
test_autopilot_artifacts.py,test_autopilot_critic.py,test_autopilot_fanout.py).
Second half — CostRouter, non-blocking ask_user, vision, sandbox, specialists-as-developers
The second half of v1.0.6 adds four more primitives that compose with Autopilot and an overhaul of the specialist tool presets so every role can actually execute code.
CostRouter — tiered LLM routing
shipit_agent.routing.CostRouter— drop-in LLM adapter that classifies each turn aseasy/medium/hardand routes to the cheapest adequate model.- Heuristic classifier (
classify_difficulty) with no extra LLM call — hard-keyword list + length thresholds + code-fence detection, tuned from real agent traces. Passdifficulty_fn=...to swap in your own oracle. Tier(llm, price_per_1k, name)— wrap any shipit_agent LLM; price only drives the report, never the routing decision.SpendReport— tier counts, estimated spend, "would-have-been" spend on hardest tier, andsavings_pct. Populated live as the runtime callscomplete()/stream().force_tier=DifficultyTier.HARDfor audits; fallback to MEDIUM when a classifier raises. Runs never die on classification errors.- Typical savings on 24h runs: 50–70%.
Non-blocking ask_user_async
- New tool
ask_user_asyncpauses an Autopilot run cleanly — does not block the loop. - File-based side channel at
~/.shipit_agent/askuser/<run_id>.json. Atomic rename on every write; crash-safe. Autopilotintegration — on every iteration, if a question is pending on the channel, the run halts with new statusawaiting_user.resume()returns immediately while the channel is still pending; once answered, the loop continues.shipit answer <run_id> "..."— CLI to reply.--index Ntargets a specific question; runningshipit answer <run_id>with no text lists pending + answered history.- Multiple outstanding questions are supported;
write_answertargets the latest by default. SHIPIT_ASKUSER_DIRenv redirects the channel (useful for tests and containerized runs).- Safe against path traversal —
run_idis slugged before becoming a filename.
Vision feedback on computer_use
- Every
screenshotaction now embeds the PNG's base64 bytes +media_typein result metadata, so a vision-capable LLM can actually reason over the captured image instead of just reading a file path. - 4 MB cap — larger PNGs set
vision=False+ avision_skip_reason; no context-window blow-ups. - Opt-out via
vision=Falsekwarg on the tool call. - Read errors surface in
vision_skip_reasonrather than raising.
Docker sandbox on code_execution
sandbox=Trueonrun_coderuns the snippet inside an ephemeral container with--network none, a--read-onlyroot filesystem, and a writable 64 MB/tmptmpfs.network=Trueopts back into bridge networking (rarely needed — isolation is the point).image=...overrides the per-language default image.- Default images:
python:3.11-slim,node:22-alpine,ruby:3.3-alpine,alpine:3.20for shells, plus typescript / php / perl / lua / r. Override viaSANDBOX_IMAGES. - Workspace mounted read-only at
/work; snippet can read but not modify host files. workspace_rootkwarg points the tool at any user-chosen directory — per-call override of the shared default. Works in both sandbox and non-sandbox modes.- Graceful fallback when Docker isn't installed — returns
metadata={"ok": False, ...}+ a clear install hint. Runs never crash.
Specialists that run + test code
- All seven role specialists (
generalist-developer,debugger,design-reviewer,product-manager,sales-outreach,customer-success,marketing-writer) now ship withrun_code+ask_user_asyncin their tool list. - Developer + debugger also keep
bash+run_tests; designer gainscomputer_use; PM + sales + marketing gainresearch_brief; CS keepshubspot_ops+gmail+slack. - Every
run_codecall acceptsworkspace_rootso the user points the specialist at their project.
Notebook 45 + 3 new doc pages
- Notebook 45 — 34 cells —
45_cost_router_async_ask_vision_sandbox.ipynb. Covers routing, async ask, vision, sandbox, workspace override, composed live streaming (critic + artifacts + router together), JSONL stream, parallel fan-out stream, and a specialist-as-developer example. routing/cost-router.md— full CostRouter guide, custom classifier recipe,force_tieroverride,SpendReportschema.autopilot/ask-user-async.md— side-channel anatomy, CLI + programmatic answer path, prompt-design rules.tools/code-execution-sandbox.md— per-language image table,workspace_rootuse case, Docker-missing fallback.tools/computer-use.md— new "Vision feedback" section with themetadata["vision"]contract and opt-out.
Tests
- 58 new tests (27
test_cost_router.py, 14test_askuser_async.py, 5test_computer_use_vision.py, 12test_code_execution_sandbox.py). - Grand total: 863 tests, 0 failures.
New / changed public surface
| Symbol | Where |
|---|---|
CostRouter, Tier, SpendReport, DifficultyTier, classify_difficulty | shipit_agent.routing |
ask_question, write_answer, pending_questions, all_entries, channel_file, channel_dir, clear | shipit_agent.askuser_channel |
AskUserAsyncTool | shipit_agent.tools.ask_user_async |
AutopilotResult.status == "awaiting_user" | shipit_agent.autopilot.result |
CodeExecutionTool.run(..., sandbox=True, network=True, image="...", workspace_root="...") | shipit_agent.tools.code_execution |
build_sandbox_command, SANDBOX_IMAGES, SANDBOX_CMDS | shipit_agent.tools.code_execution.sandbox |
ComputerUseTool.run(..., vision=False) + metadata["vision"/"image_base64"/"media_type"] | shipit_agent.tools.computer_use |
shipit answer <run_id> [text] [--index N] | CLI subcommand |
v1.0.5 — 2026-04-18
Prebuilt agents, multi-agent crews, notifications, and cost tracking. 40 ready-to-use agent personas across 8 categories. DAG-based ShipCrew orchestration with sequential, parallel, and hierarchical modes. Slack, Discord, and Telegram notification hub. Real-time cost tracking with budget enforcement. 4 new notebooks, 4 new doc pages, 153 new tests. 706 total tests. All passing.
Prebuilt Agents — 40 Ready-to-Use Personas
shipit_agent.agentsmodule —AgentDefinitiondataclass +AgentRegistryfor loading, searching, and composing agent personas.- 40 agents across 8 categories: Architecture (5), Code Quality (6), Security (5), DevOps (5), Testing (5), Planning (4), Research (5), Content (5).
AgentRegistry.default()— loads in one line. Search, browse by category, merge with project-local agents..shipit/agents/override — drop JSON files in your project; they override built-in agents with the same ID.
ShipCrew — Multi-Agent Crew Orchestration
ShipCrew,ShipAgent,ShipTask— DAG-based multi-agent crews with task dependencies.- Three execution modes:
sequential,parallel(ThreadPoolExecutor),hierarchical(LLM-driven assignment + review). - Template variable resolution —
{output_key}in descriptions auto-resolves from upstream outputs. ShipAgent.from_registry()— load crew agents from the prebuilt registry.- Streaming —
crew.stream()yields events for crew start, task start/complete/fail, crew complete. - Validation — cycle detection (Kahn's algorithm), missing agent checks, unknown dependency warnings.
Notification Hub — Slack, Discord & Telegram
SlackNotifier— Block Kit webhooks with color-coded severity. Zero external deps.DiscordNotifier— rich embeds with inline metadata fields.TelegramNotifier— Bot API with MarkdownV2 and auto-escaped special characters.NotificationManager— multi-channel dispatch withmin_severityandeventsfiltering.manager.as_hooks()— auto-notify on agent lifecycle events.
Cost Tracking & Budgets
CostTracker— real-time per-call cost tracking with 20+ model pricing table.Budget(max_dollars=5.00)— enforcement withBudgetExceededErrorandon_cost_alertcallback.tracker.as_hooks()— automatic cost tracking from every LLM call.- Model aliases —
"opus","sonnet","haiku"resolve to full model IDs.
Notebooks, Docs & Tests
- 4 notebooks: Prebuilt Agents (25 cells), ShipCrew (25 cells), Notifications (27 cells), Cost Tracking (31 cells).
- 4 doc pages:
guides/prebuilt-agents.md,deep-agents/ship-crew.md,guides/notifications.md,guides/cost-tracking.md. - 153 new tests (553 → 706 total). 29 new source files.
v1.0.4 — 2026-04-12
Skills, tools, and runtime power-up. All 32 tool prompts rewritten with decision trees and anti-patterns. Full skill-to-tool linking for all 37 packaged skills. Automatic iteration boost for skill-driven workflows. Expanded bash allowlist (50+ commands). Streaming, chat, and project-building examples across 3 notebooks. Comprehensive docstrings across every key module. 32 skill tests. All passing.
Skills — Full Tool Linking
- 37 skill tool bundles (up from 10) — every packaged skill now declares the built-in tools it needs. When a skill is selected, the agent auto-attaches the right tools.
- Shared tool groups (
_FILE_CORE,_CODE_CORE,_WEB_CORE) reduce duplication across bundles. validate_tool_bundles()— new helper that checks every tool name inSKILL_TOOL_BUNDLESagainst the real builtin map.
Agent — Iteration Boost & Efficiency
_effective_max_iterations()— auto-boosts 4 → 8 when skills inject extra tools so skill-driven workflows can complete without cutting off early.- Single skill computation —
run()andstream()now compute skills once and reuse (previously 3x per call).
Tool Prompts — All 32 Upgraded
Every tool's prompt.py rewritten with decision trees, anti-patterns, workflow guidance, and cross-tool coordination.
Bash Allowlist Expansion
- 50+ safe commands added:
mkdir,touch,cp,mv,echo,grep,curl,docker,kubectl,terraform,aws,go,cargo,npx,tsc,eslint,black,isort,tree,awk,cut,diff, and more.
Documentation
- Comprehensive docstrings on
agent.py,builtins.py,skills/loader.py,skills/registry.py,skills/tool_bundles.py,deep_agent/factory.py. - 6 tool doc pages updated with enhanced prompts.
- Skills guide expanded with 7 real-world examples, streaming sections, chat sessions, and event type reference.
- Notebook 27 rewritten (38 cells): streaming, chat streaming, project build, web scraping, DeepAgent chat.
- Notebook 29 (new): DeepAgent + skills + memory + verify + reflect + sub-agents + streaming.
- Notebook 30 (new): real-world full project build across 6 steps with 5 different skills.
Tests
- 15 new tests (17 → 32 total): iteration boost, bundle validation, chat sessions, streaming, chat streaming, memory + skills, DeepAgent chat/stream.
v1.0.3 — 2026-04-11
Major feature release. Super RAG subsystem, DeepAgent factory (verify / reflect / goal / sub-agents), live multi-agent chat REPL (shipit chat), Agent memory cookbook, plus deep docs + notebook coverage. 521 unit tests. 19 Bedrock end-to-end smoke tests. All passing.
Super RAG
shipit_agent.ragsubsystem — pluggable chunker + embedder + vector store + keyword store + hybrid pipeline (vector + BM25 + RRF + recency bias + rerank + context expansion).rag=on every agent type — auto-wiresrag_search/rag_fetch_chunk/rag_list_sourcestools, augments the system prompt with citation instructions, and attachesresult.rag_sourceswith stable[N]citation indices.- Adapters —
DrkCacheVectorStore(pgvector over psycopg2) + lazy Chroma / Qdrant / pgvector. - Thread-local per-run source tracker so concurrent runs never leak citations.
DeepAgent
shipit_agent.deep.DeepAgent— power-user factory bundling seven deep tools:plan_task,decompose_problem,workspace_files,sub_agent,synthesize_evidence,decision_matrix,verify_output. Guide- One-flag power features:
verify=True,reflect=True,goal=Goal(...),rag=RAG(...),memory=AgentMemory(...). agents=sub-agent delegation — plug any mix of agent types as named delegates via a built-indelegate_to_agenttool.create_deep_agent()functional helper — auto-wraps plain Python callables as tools.- Nested event streaming — sub-agent events surface inside
tool_completed.metadata['events'].
Live chat REPL
shipit chat— modern multi-agent terminal REPL. Switch agent types live, index files mid-session, save/load conversations, togglereflect/verify, inspect tools and sources. Guide- Rich slash commands:
/agent,/agents,/tools,/sources,/index,/rag,/goal,/reflect,/verify,/history,/save,/load,/reset,/info, … - Pluggable LLM provider via
--provider; persistent sessions via--session-dir.
Streaming
DeepAgent.stream()covers every execution mode (direct, verified, reflective, goal-driven, sub-agent delegation).PersistentAgent.stream()added with per-step checkpointing.rag_sourcesevent type added — emitted after every RAG-backed run.
Memory
- Dedicated Agent → Memory cookbook explaining the two memory systems (
memory_store=for the LLM'smemorytool vsAgentMemoryfor application-curated profiles). Guide - DeepAgent auto-hydration —
memory=AgentMemory(...)seeds the inner agent'shistoryfrom the conversation summary. - Notebook 26 — runnable end-to-end tour.
Docs
- New Agent section (6 pages): Overview, Examples, Streaming, With RAG, With Tools, Memory, Sessions.
- New Super RAG section (6 pages): Overview, Standalone, Files & Chunks, With Agent, With Deep Agents, Adapters, API.
- New DeepAgent page. Reference
- Parameters Reference — every constructor parameter for every agent type and key class. Reference
- Updated Architecture + Model Adapters reference pages.
- Updated quickstart with Agent / Deep Agent / RAG sections.
- Updated FAQ with "Agent types — which one should I use?".
- 5 new notebooks (22–26): RAG basics, RAG + Agent, RAG + Deep Agents, DeepAgent chat, Agent memory.
- Full-width docs layout + collapsible TOC with floating toggle, persistence via localStorage.
Build
shipit-chatscript entry point.- Granular extras:
rag,rag-openai,rag-cohere,rag-chroma,rag-qdrant,rag-pgvector,rag-drk-cache,rag-pdf,rag-docx,rag-rerank-cohere,rag-rerank-cross-encoder, plusbedrock,google,groq,together,ollama. Theallextra bundles everything.
Fixed
- Tool schema format bug —
RAGSearchTool,RAGFetchChunkTool,RAGListSourcesTool,WebhookPayloadToolnow use the wrapped{"type": "function", "function": {...}}shape. Previously they were returning flat dicts and Bedrock's Converse API was rejecting them with empty-name validation errors. New regression test scans every tool for Bedrock compatibility. memory=AgentMemorytype coercion —DeepAgentandGoalAgentno longer auto-assignAgentMemory.knowledge(aSemanticMemory) intomemory_store=(which expects aMemoryStore).memory=now only seedshistory; users passmemory_store=explicitly for the runtime'smemorytool.Agent.with_builtins(tools=[...])keyword collision — the method now accepts and merges usertools=with the builtin catalogue (last-write-wins on name collision).AgentDelegationToolstreaming — uses inner agent'sstream()and packs events intotool_completed.metadata['events'].
Test coverage
- 521 unit tests (up from 285) — green.
- 19 end-to-end Bedrock smoke tests in
scripts/smoke_bedrock_e2e.pycover every public surface end-to-end against real Bedrock.
v1.0.2 — 2026-04-10
Major feature release. Deep agents, structured output, pipelines, agent teams, advanced memory, output parsers, and runtime power features. 285 tests. 12 examples. 8 notebooks. 13 new doc pages.
Deep Agents
- GoalAgent — Autonomous goal decomposition with success criteria, streaming, and
.with_builtins(). Guide - ReflectiveAgent — Self-evaluation with quality scores and revision loop. Guide
- Supervisor / Worker — Hierarchical delegation with quality review. Guide
- AdaptiveAgent — Runtime tool creation from Python code. Guide
- PersistentAgent — Checkpoint and resume across sessions. Guide
- Channel / AgentMessage — Typed agent-to-agent communication. Guide
- AgentBenchmark — Systematic agent testing framework. Guide
- Deep Agents API Reference — Full constructor, method, and return type docs. Reference
Structured Output & Parsers
output_schemaon Agent.run() — Pydantic models + JSON schemas. Guide- JSONParser, PydanticParser, RegexParser, MarkdownParser. Guide
Composition
- Pipeline — Sequential, parallel, conditional, function steps, streaming. Guide
- AgentTeam — LLM-routed multi-agent coordination with streaming. Guide
Advanced Memory
- ConversationMemory — buffer/window/summary/token strategies. Guide
- SemanticMemory — Embedding-based vector search. Guide
- EntityMemory — Track people, projects, concepts. Guide
- AgentMemory — Unified interface with
.default(). Guide
Runtime Power Features
- Parallel tool execution. Guide
- Graceful tool failure. Guide
- Context window management. Guide
- Hooks & middleware. Guide
- Mid-run re-planning. Guide
- Async runtime. Guide
- Transient error auto-retry (429/500/503).
Changed
- Selective memory storage (breaking) — Only
persist=Truetool results stored. - Safer retry defaults —
(ConnectionError, TimeoutError, OSError)instead of(Exception,).
v1.0.1 — 2026-04-09
Maintenance release. Bug fix in the tool runner plus repo hygiene, contributor experience, and CI hardening. Strongly recommended upgrade from 1.0.0 if you use Bedrock gpt-oss-120b.
Fixed
ToolRunnerargument collision — FixedTypeError: got multiple values for argument 'context'when an LLM (notablybedrock/openai.gpt-oss-120b-1:0) emitscontextas a tool-call argument. The runner now strips reserved argument names (context,self) from tool-call arguments before forwarding. Affects every built-in tool.
Added
CHANGELOG.mdat repo root in Keep a Changelog formatCONTRIBUTING.mdwith dev setup, commit conventions, PR checklist, and "how to add a new LLM adapter / tool" guides- GitHub issue templates — structured bug report, feature request, and config forms
- PR template with 12-item verification checklist
- Test CI —
pytest -qon Python 3.11 + 3.12 × Ubuntu + macOS (4 matrix cells), with smoke-test of all 11 LLM adapter imports - Gitleaks secret scanning CI with SARIF upload to GitHub Security tab, inline PR comments, Actions summary
- Pre-commit hooks — trailing whitespace, EOF fixer, YAML/TOML validation, gitleaks v8.21.2, ruff lint + format
- Gitleaks allowlist for runtime tool outputs (scraped HTML contains false-positive "API keys" like Pushly domainKeys)
Changed
.gitignorerewritten to dedupe entries and cover all runtime directories (site/,.eggs/,pip-wheel-metadata/)- Runtime tool outputs untracked from git (
sessions/,traces/,memory.json,.shipit_notebooks/**) — they were accidentally committed in 1.0.0
Security
- Added CI and pre-commit secret scanning to prevent future credential leaks
- No runtime code changed —
shipit_agent/module is byte-identical to 1.0.0
v1.0.0 — 2026-04-09
First stable release. Focused on making the agent loop observable, interchangeable, and out of the way.
🧠 Live reasoning / thinking events
LLMResponse.reasoning_contentfield added to carry thinking/reasoning blocks from any provider- New
_extract_reasoning()helper handles three shapes:- Flat
reasoning_contenton the response message (OpenAI o-series,gpt-oss, DeepSeek R1, Anthropic via LiteLLM) - Anthropic
thinking_blocks[*].thinking(Claude extended thinking) model_dump()fallback for pydantic dumps
- Flat
- Runtime emits
reasoning_started+reasoning_completedevents whenever reasoning content is non-empty - All three LLM adapters —
OpenAIChatLLM,AnthropicChatLLM,LiteLLMChatLLM/BedrockChatLLM— share the extraction helper OpenAIChatLLMauto-passesreasoning_effort="medium"for reasoning-capable models (o1*,o3*,o4*,gpt-5*,deepseek-r1*)AnthropicChatLLMsupportsthinking_budget_tokens=Nto enable Claude extended thinking
⚡ Truly incremental streaming
agent.stream()now runs the agent on a background daemon thread- Events are pushed through a thread-safe
queue.Queueas they're emitted - Consumer loop yields events the instant they happen — no buffering, no batched delivery
- Worker exceptions are captured and re-raised on the consumer thread
- Works in Jupyter, VS Code, JupyterLab, WebSocket/SSE transports, and plain terminals
🛡️ Bulletproof Bedrock tool pairing
- Planner output is now injected as a
user-role context message rather than an orphanrole="tool"message — fixes Bedrock's "number of toolResult blocks exceeds number of toolUse blocks" error - Every
response.tool_callsentry gets a tool-result message unconditionally:- Success → real tool-result
- Retry → retries first, then final result or error
- Unknown tool → synthetic
"Error: tool X is not registered"tool-result
- Stable
call_{iteration}_{index}tool_call_ids round-trip through message metadata - Multi-iteration tool loops on Bedrock Claude, gpt-oss, and Anthropic native now work without
modify_paramsband-aids
🔑 Zero-friction provider switching
build_llm_from_env()walks upward from CWD to discover.env, so notebooks and scripts work regardless of where they're launched from- Seven providers:
openai,anthropic,bedrock,gemini,vertex,groq,together,ollama, plus a genericlitellmprovider - Per-provider credential validation with clear error messages
SHIPIT_OPENAI_TOOL_CHOICE=requiredenv var to force tool use on lazy models likegpt-4o-mini
🌐 In-process Playwright for open_url
OpenURLToolnow uses Playwright's sync Chromium directly (headless, realistic desktop Chrome UA, 1280×800 viewport)- Handles JS-rendered pages, anti-bot 503s, modern TLS/ALPN
- Stdlib
urllibfallback when Playwright is not installed — zero third-party HTTP dependencies in the core fallback path - Errors never raise out of the tool: they return as
ToolOutputwith awarningslist in metadata - Rich metadata:
fetch_method,status_code,final_url,title
🔍 Upgraded ToolSearchTool
- Replaced binary substring match with drk_cache-style fuzzy scoring:
SequenceMatcher.ratio() + 0.12 × token_hits - Configurable
limitparameter, clamped to[1, max_limit] - New init kwargs:
max_limit,default_limit,token_bonus - Structured error output for empty queries
- Ranked output with scores and "when to use" hints from
prompt_instructions - Noise filter: results below
score=0.05dropped
🪵 Full event taxonomy
14 distinct event types with documented payloads:
run_started, mcp_attached, planning_started, planning_completed, step_started, reasoning_started, reasoning_completed, tool_called, tool_completed, tool_retry, tool_failed, llm_retry, interactive_request, run_completed
🔁 Iteration-cap summarization fallback
- If the model is still calling tools when
max_iterationsis reached, the runtime gives it one more turn withtools=[]to force a natural-language summary run_completedis never empty for normal runs- Guarded with try/except so summarization failures can't mask the rest of the run
Other changes
pyproject.toml:[project.urls]now points to correct GitHub org, addsDocumentationandChangeloglinks.env.example: expanded with all new env vars documentednotebooks/04_agent_streaming_packets.ipynb: full rewrite with .env loading, credential visibility printer, and live Markdown updatesREADME.md: new v1.0 release section with 8 headline features- Full MkDocs Material documentation site at shipiit.github.io/shipit_agent
Breaking changes
None — this is the first stable release. Subsequent 1.x releases will maintain backward compatibility within the 1.x line.