Changelog

29 min read
78 sections
Edit this page

v1.0.12 — 2026-06-07

Claude API power — plus cross-provider prompt-cache accounting. v1.0.12 adds the Anthropic API's highest-leverage server features as first-class passthroughs — server-side tools, document citations, the Batch API, and interleaved thinking + server-side context editing — on top of v1.0.11's control plane. It also makes prompt caching honestly cross-provider: OpenAI's automatic cache reads are now surfaced for cost tracking. Each feature is honest about provider support; no public API was removed.

Server-side tools (Anthropic-hosted)

  • web_search(), code_execution(), computer_use(), bash(), text_editor() from shipit_agent.llms — declare them in the tools= list you pass to AnthropicChatLLM.complete(...) (mixed freely with client-side tools). They run inside Anthropic's own sandbox — zero local infrastructure, no client-side tool loop.
  • Beta headers handled automaticallycode_execution and computer_use attach their betas and route to the beta endpoint; web_search is GA and stays on the GA endpoint.
  • Surfaces in metadataLLMResponse.metadata["server_tool_use"] and ["server_tool_results"], only when present.
  • Provider note — these are Anthropic API shapes (also reachable for Anthropic models via Bedrock / LiteLLM); other providers use their own native server tools. See Agent → Server-side tools.

Citations & the Batch API

  • Citation document helperstext_document / pdf_document / url_pdf_document / content_document from shipit_agent.llms, with citations.enabled on by default. Claude grounds its answer in the document and the cited spans are parsed into metadata["citations"] — verifiable RAG.
  • Batch API runtimeBatchRequest, BatchResult, and BatchRuntime.run(...) (in shipit_agent.batch) wrap Anthropic's Messages Batches API for bulk, latency-tolerant runs at roughly 50% of standard per-token price. submit / status / results / cancel are exposed too.
  • Provider note — Anthropic citations and Anthropic batches today; OpenAI also has a Batch API and generalising the runtime is on the roadmap. See Agent → Citations & Batch API.

Interleaved thinking & context editing

  • AnthropicChatLLM(interleaved_thinking=True, thinking_budget_tokens=…) — the model thinks between tool calls. The interleaved-thinking-2025-05-14 beta attaches only when both are set; metadata["thinking_blocks"] carries the signed thinking blocks for round-tripping.
  • context_management= — forwarded as Anthropic's context_management request param (with its beta header) so the API clears stale tool results server-side.
  • Provider note — extended / interleaved thinking is Anthropic; OpenAI reasoning models and Gemini thinking are the equivalents elsewhere (reasoning content is captured for all of them). See Agent → Interleaved thinking & context editing.

Cross-provider prompt caching

  • OpenAI cached-token surfacing — OpenAI does automatic prompt caching; shipit now reads usage.prompt_tokens_details.cached_tokens into usage["cache_read_input_tokens"] — the same key the CostTracker reads for Anthropic — so OpenAI cache reads bill at the cheaper rate. LiteLLMChatLLM forwards both shapes. Anthropic / Bedrock / Vertex keep explicit cache_control breakpoints (default on for Claude). Caching is cross-provider, not Anthropic-only. See Agent → Prompt caching.

v1.0.11 — 2026-06-07

A control plane for tool calls — plus prompt caching and a memory tool. v1.0.11 brings the Claude Code permission layer to the library: declarative allow/deny/ask rules, read-only plan mode, human-in-the-loop callbacks, and blocking/rewriting hooks. It also turns on Anthropic prompt caching by default for Claude-family models and adds an Anthropic-style memory tool. Folded in is the 1.0.10 bug-fix & hardening work. No public API was removed.

Control plane — permissions, plan mode & hooks

  • Agent(permission_mode=...)"default", "acceptEdits", "plan", or "bypass", mirroring Claude Code. plan makes a run read-only; acceptEdits auto-approves file edits.
  • PermissionEngine(allow=[...], deny=[...], ask=[...])fnmatch globs on tool name with a predictable precedence: deny > mode > allow > ask > callback > default. Pass it as permissions= (also accepts a bare mode string or a kwargs dict).
  • agent.plan(prompt) — one-call read-only planning: the agent may use read-only tools and writes a step-by-step plan instead of acting.
  • permission_callback(name, args) -> PermissionResult | None — programmatic human-in-the-loop approval, consulted on ask rules and as a catch-all.
  • Blocking & rewriting hooks@hooks.on_before_tool may return {"decision": "deny"} to block a call or a PermissionResult with updated_arguments to rewrite it; @hooks.on_user_prompt can redact the incoming prompt. Hooks remain observe-only when they return None.
  • Denied calls are visible — a blocked tool emits a tool_denied event and feeds the model a was NOT run tool message so it can recover.
  • New top-level exports: PermissionEngine, PermissionResult, PermissionDecision. See Agent → Permissions, plan mode & blocking hooks.

Prompt caching

  • On by default for Claude-family models. AnthropicChatLLM(prompt_caching=True) and LiteLLMChatLLM(prompt_caching=True) cache the stable prefix (system prompt + tools) on Anthropic, Bedrock, and Vertex via cache_control breakpoints.
  • Cost-awareusage["cache_read_input_tokens"] and usage["cache_creation_input_tokens"] flow into CostTracker; cache reads bill at roughly 10% of input. See Agent → Prompt caching.

Claude-style memory tool

  • ClaudeMemoryTool — Anthropic's memory_20250818 tool shape: a single command-driven tool (view / create / str_replace / insert / delete / rename) over a sandboxed memory directory (.shipit_workspace/memories by default) for cross-session learning. Attach via Agent(tools=[ClaudeMemoryTool(...)]). See Tools → Claude-style memory tool.

Hardening (folded in from 1.0.10)

  • text_delta_callback regression (v1.0.9) fixed — the runtime now inspects the adapter signature and only passes the callback to adapters that accept it, so every custom adapter works unchanged.
  • Multi-turn sessions no longer stack duplicate system prompts when reusing a session_store + session_id.
  • Tool security — Bash rejects command/process substitution and redirection; open_url is http(s)-only and blocks SSRF targets; the SQL read-only guard scans the whole statement and rejects stacked statements; OAuth validates the CSRF state nonce; edit_file refuses non-UTF-8 files; FileCredentialStore chmods 0600 and writes atomically.
  • Reliability — MCP transports close on error; parallel tools get isolated state; the iteration-cap turn is now accounted for; CostTracker flags has_unknown_pricing instead of silently billing $0.
  • 1742 tests passing (+180 new). 0 regressions.

v1.0.9 — 2026-05-14

Inline text streaming + multimodal media references. Two features that make shipit feel live in chat UIs: token-by-token text streaming for real-time typing, and first-class image/file references in prompts.

Inline text streaming

  • LLM.complete(text_delta_callback=…) — stream assistant text token-by-token as it's generated, instead of waiting for the full response. The callback fires for each incremental text chunk.
  • AgentRuntime emits text_delta events — drive SSE or WebSocket consumers directly from the event stream for ChatGPT-style live typing in the browser.
  • Non-streaming behavior preserved — streaming is opt-in per call; omit the callback and complete() behaves exactly as before.
  • Implemented end-to-end for LiteLLM, and no-op-compatible for the other adapters (they return the full text in a single delta), so nothing breaks if a backend can't stream.

Multimodal media references

  • MediaReference — reference an image or file inside a prompt without inlining bytes; the runtime resolves it at send time.
  • MediaStore — pluggable storage for media, with InMemoryMediaStore for tests and short-lived runs and FileMediaStore for on-disk persistence.
  • extract_media_refs + build_multimodal_message — pull media references out of a prompt and assemble the provider-native multimodal message payload.
  • See Agent → Multimodal chat.

v1.0.8 — 2026-05-09

Structured output overhaul + verifier network. Two flagship features that genuinely beat LangChain on the surfaces it tries hardest at — and ship more broadly applicable wins than v1.0.7's connector explosion.

Structured output — same-conversation validation retry

  • Agent.run(prompt, output_schema=MyModel, max_validation_retries=2) — pass a Pydantic model or JSON Schema dict; get back a typed result.parsed.
  • Auto-retry on validation failure inside the same conversation — when the first parse fails, the runtime appends the bad assistant turn + a corrective user turn ("that response could not be parsed: …") and retries. No separate "fixing LLM" call (LangChain's OutputFixingParser requires one).
  • Streaming partial JSON parserparse_partial_json('{"a": "hel') returns {"a": "hel"}. StructuredOutput.stream(prompt) yields progressively richer dicts as tokens arrive, then a final validated typed object.
  • StructuredOutput — standalone wrapper for one-shot extraction without the agent loop; same retry path, exposed publicly.
  • New top-level exports: StructuredOutput, StructuredOutputResult, parse_partial_json. New result.parsed and result.output corrected text semantics on Agent.run.
  • See Agent → Structured output.

Verifier network — process supervision

  • Agent(verifier=VerifierNetwork(llm=cheap_llm)) — a second cheap LLM vetoes hallucinated tool calls before they fire and rates progress between iterations. Both checks fail open (verifier failures never block the agent).
  • Pre-tool veto — wraps every tool. Verifier returns allow | veto | rewrite; vetoed calls become synthetic error tool-results so the agent re-plans without the bad action having actually run.
  • Progress check — after each iteration, scores progress 0-1. When the score stays below progress_threshold for progress_window consecutive iterations, maybe_nudge() returns a "you're stalling" message you can inject as a user turn.
  • Confidence-gated — verdicts below veto_min_confidence get downgraded to ALLOW (avoid over-blocking on uncertain calls).
  • Hard capsmax_pretool_calls_per_run, max_progress_calls_per_run so the verifier itself can't run away on cost.
  • Telemetryverifier.stats exposes per-run counters: vetoes, rewrites, nudges, score history.
  • LangGraph's ToolNode has no per-call gating. LangChain's RunnableWithMessageHistory has no progress detector. Process supervision in shipit is one constructor argument.
  • See Agent → Verifier network.

Episodic memory consolidation

  • MemoryConsolidator(llm=cheap_llm).consolidate(memory=..., recent_messages=...) — LLM distills the last conversation into 3-8 durable facts and writes them to SemanticMemory. Categories (preference, project, goal, person, other) are tracked for filtering.
  • Forgetting curveconsolidator.decay(knowledge, half_life_days=14) applies exponential decay to fact strength and prunes facts below forgetting_threshold. Pure local arithmetic; no LLM call.
  • Core memoryconsolidator.core_memory(knowledge, top_k=5) returns the top-K most-load-bearing facts ranked by strength + 0.1·log1p(retrievals). Inject into the system prompt every turn for ChatGPT-style "remembers things across sessions".
  • Retrieval bumpingconsolidator.record_retrieval(knowledge, [fact_texts]) increments retrieval counts. Frequently-retrieved facts rise to core memory automatically.
  • New top-level exports: MemoryConsolidator, DistilledFact, ConsolidationResult.
  • ChatGPT's Memories feature is add_fact(text) with no decay, no retrieval-based promotion. Ours is principled and self-hostable.
  • See Agent → Episodic memory consolidation.

Time-travel replay

  • TraceReplayer.from_store(store, trace_id) — load any saved trace and walk events programmatically. Three constructors: from_record, from_store, from_file.
  • replayer.fork(at_event=N, edit_user_message='...') — capture the conversation state at any event, optionally with a tweaked user prompt. Returns a ReplayCheckpoint.
  • checkpoint.continue_from(agent=fresh_agent) — resume the run on a fresh Agent, with agent.history pre-filled. Forwards arbitrary Agent.run kwargs (e.g. output_schema=).
  • diff_traces(left, right) — side-by-side comparison. Reports matched events, divergence point, type mismatches, and only-in-left / only-in-right tails. .to_lines() for human-readable rendering.
  • New top-level exports: TraceReplayer, ReplayCheckpoint, ReplayResult, ForkPoint, TraceDiff, diff_traces.
  • LangSmith's Playground is SaaS-only. Inngest's branching is SaaS-only. Ours is library-level, open-source, and works against your existing FileTraceStore.
  • See Agent → Time-travel replay.

ComputerUseAgent — browser automation

  • ComputerUseAgent(llm=, browser=, goal=) — screenshot → reason → act loop. Show a screenshot to a vision-capable LLM, parse a structured action back, execute, repeat until DONE.
  • PlaywrightBrowserSession.launch(headless=True) — production driver. Context-manager support; pip install playwright && playwright install chromium to enable.
  • MockBrowserSession — deterministic test double that records every call. Unit-test computer-use logic without spawning a browser.
  • Two action emit shapes — Anthropic's native computer-use tool (structured tool_use block) AND plain-text fallback (ACTION: click 100,200) for any vision LLM.
  • parse_action(raw) — pure parser, no IO. Handles both shapes plus prose-wrapped responses.
  • Recovery — when an action raises, the agent surfaces the error back to the model as a user message. Production-ready resilience without extra code.
  • New top-level exports: ComputerUseAgent, BrowserSession, MockBrowserSession, PlaywrightBrowserSession, ComputerAction, ComputerUseResult, ActionKind, ActionRecord, parse_action.
  • Devin / Multi-On / OpenAI Operator are SaaS products. Ours is a library — self-host, plug into your own loop, fork the implementation.
  • See Agent → ComputerUseAgent.

Tests + docs

  • +318 unit tests (1190 → 1508), zero regressions, all old tests still pass.
  • Five new notebooks — 54_structured_output_with_retry.ipynb, 55_verifier_network.ipynb, 56_episodic_memory_consolidation.ipynb, 57_time_travel_replay.ipynb, 58_computer_use_agent.ipynb.
  • Five new docs pages with full API reference, configuration deep dives, cost analysis, real-life examples, and beat-LangChain / Operator / ChatGPT comparison tables.

v1.0.7 — 2026-04-24

Agents for every role. Twelve new tools, nine new persona specialists, seven persona walk-through notebooks. shipit-agent is no longer only a developer-agent framework — it ships agents for developers, designers, sales reps, PMs, data analysts, finance, customer support, and recruiters. 1190 unit tests, 286 new in this release, zero regressions.

See RELEASE_NOTES_1.0.7.md for the full breakdown.


v1.0.6 — 2026-04-21

Autopilot — the long-running runtime. Plus 7 new role specialists, 3 new tools, and 8 new notebooks. Autopilot turns any agent into a budget-gated, checkpointed, streaming worker that runs until every success criterion is met. Fan-out dispatches N children in parallel. A reflection critic short-circuits the loop once a confident reviewer confirms the goal. Artifacts capture code blocks, markdown docs, and tool outputs as structured deliverables. A scheduler daemon drains a persistent JSON queue for 24-hour operation. 8 new Bedrock-Llama notebooks. 805 total tests. All passing.

Autopilot — long-running runtime

  • Autopilot(llm, goal, budget, …) — composes GoalAgent with budget gates, atomic checkpoints, heartbeats, and a live event stream.
  • BudgetPolicy(max_seconds, max_tool_calls, max_tokens, max_dollars, max_iterations) — every axis independently honored; set any to None / 0 to disable.
  • Goal-satisfaction termination, not step count. Stops the moment every criterion is verified OR any budget trips.
  • Atomic JSON checkpoints per iteration — ~/.shipit_agent/checkpoints/<run_id>.json. Crash → autopilot.resume(run_id) picks up at the next iteration.
  • autopilot.stream(run_id) — iterator of {kind, ...} events: autopilot.run_started, autopilot.iteration, autopilot.heartbeat, autopilot.criteria_satisfied, autopilot.budget_exceeded, autopilot.result.
  • default_heartbeat_stderr — drop-in sink. Custom callables (Slack / Datadog / webhook) just as easy.

Reflection critic

  • Critic(llm=..., confidence_threshold=0.75) — scores every iteration's output against the goal's criteria and feeds suggestions into the next iteration's prompt.
  • critic=True on Autopilot to use your run's LLM as a self-check; pass a Critic(llm=reviewer_llm) for a dedicated stronger reviewer.
  • Confidence-gated termination — flag-flips only count when the critic meets the confidence gate. Low-confidence "yes" still logs feedback but does not halt.
  • JSON-tolerant parsing — handles fenced ```json, extra prose, padding/trimming of criteria, and garbage input without raising.
  • New event kind autopilot.critic on every iteration's stream.

Artifacts — structured deliverables

  • ArtifactCollector — collects Artifact(kind, name, content, language, iteration) during the run.
  • Auto-extraction from every iteration's output — fenced code blocks (kind="code", with language) and top-level markdown docs (kind="markdown").
  • Tool-metadata ingestion — tools that declare {"artifact": True, "kind": ..., "name": ..., "content": ...} in their result metadata are captured explicitly.
  • Optional disk persistence — one JSON file per artifact, handy for CI build outputs.
  • New event kind autopilot.artifact; final AutopilotResult.artifacts carries the full list.

Parallel fan-out

  • autopilot.fanout(items, objective_template, criteria_template, max_parallel, child_budget_frac) — ThreadPoolExecutor-backed N-way parallelism.
  • Per-child budget scaling — each child inherits parent_budget * child_budget_frac (default 20%). Keeps aggregate spend bounded on 50-item batches.
  • autopilot.fanout_stream() — live per-child events (autopilot.fanout_child) for dashboard rendering.
  • FanoutResult(children, aggregated_output, wall_seconds, failed) — rolled-up status (completed | partial | failed), ordered children, default markdown digest or custom aggregator.

Scheduler daemon

  • SchedulerDaemon(llm_factory, queue_path, tick_seconds) — persistent JSON goal queue at ~/.shipit_agent/autopilot-queue.json.
  • enqueue(), list_queue(), remove(), run_once(), run_forever(). Stateless daemon; crash-safe.
  • Heartbeat events on idle so you can wire Slack / Datadog telemetry.
  • CLI: shipit autopilot, shipit daemon, shipit queue {add,list,remove}. Systemd / launchd / Docker recipes in the docs.

7 new role specialists

  • Engineeringgeneralist-developer, debugger
  • Designdesign-reviewer
  • Productproduct-manager
  • Go-to-marketsales-outreach, customer-success, marketing-writer
  • Auto-applied to agents.json on import — 40 → 47 specialists total.

3 new power tools

  • computer_use — drive the local desktop (screenshots, click, type, drag, key chords). Platform backends for macOS (cliclick / osascript), Linux (xdotool / scrot), Windows (PowerShell). Graceful install hints when a dep is missing.
  • hubspot_ops — HubSpot CRM v3 REST wrapper. Search / get / create contacts, companies, deals; attach notes. Auth via HUBSPOT_TOKEN env.
  • research_brief — one-call research primitive. Web search + top-page skim + numbered citations. No API key (DuckDuckGo HTML). Optional deep=True fetches each source page for richer summaries.

Notebooks — 37 through 44

  • 37 — Autopilot quickstart.
  • 38 — Live streaming (autopilot.stream(), render_stream, custom heartbeats).
  • 39 — Persistence, resume, scheduler daemon.
  • 40 — Developer / Debugger / Researcher specialists.
  • 41 — Design / PM / Sales / CS / Marketing specialists.
  • 42computer_use / hubspot_ops / research_brief tools.
  • 43 — Fan-out · Critic · Artifacts.
  • 44The Complete Tour — every feature end-to-end in one notebook.

All notebooks use build_llm_from_env("bedrock") — default is Bedrock Llama 4 Scout, matches the 01–36 series.

Other changes

  • AutopilotResult grew artifacts: list[dict] and critic_verdict: dict. Existing fields unchanged.
  • Autopilot accepts critic=True | Critic(...) and artifacts=True | ArtifactCollector(...).
  • Fan-out helpers (_scale_budget, _slug, _rollup_status) are pure functions — re-use them in your own dispatchers.
  • 39 new tests (test_autopilot_artifacts.py, test_autopilot_critic.py, test_autopilot_fanout.py).

Second half — CostRouter, non-blocking ask_user, vision, sandbox, specialists-as-developers

The second half of v1.0.6 adds four more primitives that compose with Autopilot and an overhaul of the specialist tool presets so every role can actually execute code.

CostRouter — tiered LLM routing

  • shipit_agent.routing.CostRouter — drop-in LLM adapter that classifies each turn as easy / medium / hard and routes to the cheapest adequate model.
  • Heuristic classifier (classify_difficulty) with no extra LLM call — hard-keyword list + length thresholds + code-fence detection, tuned from real agent traces. Pass difficulty_fn=... to swap in your own oracle.
  • Tier(llm, price_per_1k, name) — wrap any shipit_agent LLM; price only drives the report, never the routing decision.
  • SpendReport — tier counts, estimated spend, "would-have-been" spend on hardest tier, and savings_pct. Populated live as the runtime calls complete() / stream().
  • force_tier=DifficultyTier.HARD for audits; fallback to MEDIUM when a classifier raises. Runs never die on classification errors.
  • Typical savings on 24h runs: 50–70%.

Non-blocking ask_user_async

  • New tool ask_user_async pauses an Autopilot run cleanly — does not block the loop.
  • File-based side channel at ~/.shipit_agent/askuser/<run_id>.json. Atomic rename on every write; crash-safe.
  • Autopilot integration — on every iteration, if a question is pending on the channel, the run halts with new status awaiting_user. resume() returns immediately while the channel is still pending; once answered, the loop continues.
  • shipit answer <run_id> "..." — CLI to reply. --index N targets a specific question; running shipit answer <run_id> with no text lists pending + answered history.
  • Multiple outstanding questions are supported; write_answer targets the latest by default.
  • SHIPIT_ASKUSER_DIR env redirects the channel (useful for tests and containerized runs).
  • Safe against path traversal — run_id is slugged before becoming a filename.

Vision feedback on computer_use

  • Every screenshot action now embeds the PNG's base64 bytes + media_type in result metadata, so a vision-capable LLM can actually reason over the captured image instead of just reading a file path.
  • 4 MB cap — larger PNGs set vision=False + a vision_skip_reason; no context-window blow-ups.
  • Opt-out via vision=False kwarg on the tool call.
  • Read errors surface in vision_skip_reason rather than raising.

Docker sandbox on code_execution

  • sandbox=True on run_code runs the snippet inside an ephemeral container with --network none, a --read-only root filesystem, and a writable 64 MB /tmp tmpfs.
  • network=True opts back into bridge networking (rarely needed — isolation is the point).
  • image=... overrides the per-language default image.
  • Default images: python:3.11-slim, node:22-alpine, ruby:3.3-alpine, alpine:3.20 for shells, plus typescript / php / perl / lua / r. Override via SANDBOX_IMAGES.
  • Workspace mounted read-only at /work; snippet can read but not modify host files.
  • workspace_root kwarg points the tool at any user-chosen directory — per-call override of the shared default. Works in both sandbox and non-sandbox modes.
  • Graceful fallback when Docker isn't installed — returns metadata={"ok": False, ...} + a clear install hint. Runs never crash.

Specialists that run + test code

  • All seven role specialists (generalist-developer, debugger, design-reviewer, product-manager, sales-outreach, customer-success, marketing-writer) now ship with run_code + ask_user_async in their tool list.
  • Developer + debugger also keep bash + run_tests; designer gains computer_use; PM + sales + marketing gain research_brief; CS keeps hubspot_ops + gmail + slack.
  • Every run_code call accepts workspace_root so the user points the specialist at their project.

Notebook 45 + 3 new doc pages

  • Notebook 45 — 34 cells45_cost_router_async_ask_vision_sandbox.ipynb. Covers routing, async ask, vision, sandbox, workspace override, composed live streaming (critic + artifacts + router together), JSONL stream, parallel fan-out stream, and a specialist-as-developer example.
  • routing/cost-router.md — full CostRouter guide, custom classifier recipe, force_tier override, SpendReport schema.
  • autopilot/ask-user-async.md — side-channel anatomy, CLI + programmatic answer path, prompt-design rules.
  • tools/code-execution-sandbox.md — per-language image table, workspace_root use case, Docker-missing fallback.
  • tools/computer-use.md — new "Vision feedback" section with the metadata["vision"] contract and opt-out.

Tests

  • 58 new tests (27 test_cost_router.py, 14 test_askuser_async.py, 5 test_computer_use_vision.py, 12 test_code_execution_sandbox.py).
  • Grand total: 863 tests, 0 failures.

New / changed public surface

SymbolWhere
CostRouter, Tier, SpendReport, DifficultyTier, classify_difficultyshipit_agent.routing
ask_question, write_answer, pending_questions, all_entries, channel_file, channel_dir, clearshipit_agent.askuser_channel
AskUserAsyncToolshipit_agent.tools.ask_user_async
AutopilotResult.status == "awaiting_user"shipit_agent.autopilot.result
CodeExecutionTool.run(..., sandbox=True, network=True, image="...", workspace_root="...")shipit_agent.tools.code_execution
build_sandbox_command, SANDBOX_IMAGES, SANDBOX_CMDSshipit_agent.tools.code_execution.sandbox
ComputerUseTool.run(..., vision=False) + metadata["vision"/"image_base64"/"media_type"]shipit_agent.tools.computer_use
shipit answer <run_id> [text] [--index N]CLI subcommand

v1.0.5 — 2026-04-18

Prebuilt agents, multi-agent crews, notifications, and cost tracking. 40 ready-to-use agent personas across 8 categories. DAG-based ShipCrew orchestration with sequential, parallel, and hierarchical modes. Slack, Discord, and Telegram notification hub. Real-time cost tracking with budget enforcement. 4 new notebooks, 4 new doc pages, 153 new tests. 706 total tests. All passing.

Prebuilt Agents — 40 Ready-to-Use Personas

  • shipit_agent.agents moduleAgentDefinition dataclass + AgentRegistry for loading, searching, and composing agent personas.
  • 40 agents across 8 categories: Architecture (5), Code Quality (6), Security (5), DevOps (5), Testing (5), Planning (4), Research (5), Content (5).
  • AgentRegistry.default() — loads in one line. Search, browse by category, merge with project-local agents.
  • .shipit/agents/ override — drop JSON files in your project; they override built-in agents with the same ID.

ShipCrew — Multi-Agent Crew Orchestration

  • ShipCrew, ShipAgent, ShipTask — DAG-based multi-agent crews with task dependencies.
  • Three execution modes: sequential, parallel (ThreadPoolExecutor), hierarchical (LLM-driven assignment + review).
  • Template variable resolution{output_key} in descriptions auto-resolves from upstream outputs.
  • ShipAgent.from_registry() — load crew agents from the prebuilt registry.
  • Streamingcrew.stream() yields events for crew start, task start/complete/fail, crew complete.
  • Validation — cycle detection (Kahn's algorithm), missing agent checks, unknown dependency warnings.

Notification Hub — Slack, Discord & Telegram

  • SlackNotifier — Block Kit webhooks with color-coded severity. Zero external deps.
  • DiscordNotifier — rich embeds with inline metadata fields.
  • TelegramNotifier — Bot API with MarkdownV2 and auto-escaped special characters.
  • NotificationManager — multi-channel dispatch with min_severity and events filtering.
  • manager.as_hooks() — auto-notify on agent lifecycle events.

Cost Tracking & Budgets

  • CostTracker — real-time per-call cost tracking with 20+ model pricing table.
  • Budget(max_dollars=5.00) — enforcement with BudgetExceededError and on_cost_alert callback.
  • tracker.as_hooks() — automatic cost tracking from every LLM call.
  • Model aliases"opus", "sonnet", "haiku" resolve to full model IDs.

Notebooks, Docs & Tests

  • 4 notebooks: Prebuilt Agents (25 cells), ShipCrew (25 cells), Notifications (27 cells), Cost Tracking (31 cells).
  • 4 doc pages: guides/prebuilt-agents.md, deep-agents/ship-crew.md, guides/notifications.md, guides/cost-tracking.md.
  • 153 new tests (553 → 706 total). 29 new source files.

v1.0.4 — 2026-04-12

Skills, tools, and runtime power-up. All 32 tool prompts rewritten with decision trees and anti-patterns. Full skill-to-tool linking for all 37 packaged skills. Automatic iteration boost for skill-driven workflows. Expanded bash allowlist (50+ commands). Streaming, chat, and project-building examples across 3 notebooks. Comprehensive docstrings across every key module. 32 skill tests. All passing.

Skills — Full Tool Linking

  • 37 skill tool bundles (up from 10) — every packaged skill now declares the built-in tools it needs. When a skill is selected, the agent auto-attaches the right tools.
  • Shared tool groups (_FILE_CORE, _CODE_CORE, _WEB_CORE) reduce duplication across bundles.
  • validate_tool_bundles() — new helper that checks every tool name in SKILL_TOOL_BUNDLES against the real builtin map.

Agent — Iteration Boost & Efficiency

  • _effective_max_iterations() — auto-boosts 4 → 8 when skills inject extra tools so skill-driven workflows can complete without cutting off early.
  • Single skill computationrun() and stream() now compute skills once and reuse (previously 3x per call).

Tool Prompts — All 32 Upgraded

Every tool's prompt.py rewritten with decision trees, anti-patterns, workflow guidance, and cross-tool coordination.

Bash Allowlist Expansion

  • 50+ safe commands added: mkdir, touch, cp, mv, echo, grep, curl, docker, kubectl, terraform, aws, go, cargo, npx, tsc, eslint, black, isort, tree, awk, cut, diff, and more.

Documentation

  • Comprehensive docstrings on agent.py, builtins.py, skills/loader.py, skills/registry.py, skills/tool_bundles.py, deep_agent/factory.py.
  • 6 tool doc pages updated with enhanced prompts.
  • Skills guide expanded with 7 real-world examples, streaming sections, chat sessions, and event type reference.
  • Notebook 27 rewritten (38 cells): streaming, chat streaming, project build, web scraping, DeepAgent chat.
  • Notebook 29 (new): DeepAgent + skills + memory + verify + reflect + sub-agents + streaming.
  • Notebook 30 (new): real-world full project build across 6 steps with 5 different skills.

Tests

  • 15 new tests (17 → 32 total): iteration boost, bundle validation, chat sessions, streaming, chat streaming, memory + skills, DeepAgent chat/stream.

v1.0.3 — 2026-04-11

Major feature release. Super RAG subsystem, DeepAgent factory (verify / reflect / goal / sub-agents), live multi-agent chat REPL (shipit chat), Agent memory cookbook, plus deep docs + notebook coverage. 521 unit tests. 19 Bedrock end-to-end smoke tests. All passing.

Super RAG

  • shipit_agent.rag subsystem — pluggable chunker + embedder + vector store + keyword store + hybrid pipeline (vector + BM25 + RRF + recency bias + rerank + context expansion).
  • rag= on every agent type — auto-wires rag_search / rag_fetch_chunk / rag_list_sources tools, augments the system prompt with citation instructions, and attaches result.rag_sources with stable [N] citation indices.
  • AdaptersDrkCacheVectorStore (pgvector over psycopg2) + lazy Chroma / Qdrant / pgvector.
  • Thread-local per-run source tracker so concurrent runs never leak citations.

DeepAgent

  • shipit_agent.deep.DeepAgent — power-user factory bundling seven deep tools: plan_task, decompose_problem, workspace_files, sub_agent, synthesize_evidence, decision_matrix, verify_output. Guide
  • One-flag power features: verify=True, reflect=True, goal=Goal(...), rag=RAG(...), memory=AgentMemory(...).
  • agents= sub-agent delegation — plug any mix of agent types as named delegates via a built-in delegate_to_agent tool.
  • create_deep_agent() functional helper — auto-wraps plain Python callables as tools.
  • Nested event streaming — sub-agent events surface inside tool_completed.metadata['events'].

Live chat REPL

  • shipit chat — modern multi-agent terminal REPL. Switch agent types live, index files mid-session, save/load conversations, toggle reflect/verify, inspect tools and sources. Guide
  • Rich slash commands: /agent, /agents, /tools, /sources, /index, /rag, /goal, /reflect, /verify, /history, /save, /load, /reset, /info, …
  • Pluggable LLM provider via --provider; persistent sessions via --session-dir.

Streaming

  • DeepAgent.stream() covers every execution mode (direct, verified, reflective, goal-driven, sub-agent delegation).
  • PersistentAgent.stream() added with per-step checkpointing.
  • rag_sources event type added — emitted after every RAG-backed run.

Memory

  • Dedicated Agent → Memory cookbook explaining the two memory systems (memory_store= for the LLM's memory tool vs AgentMemory for application-curated profiles). Guide
  • DeepAgent auto-hydrationmemory=AgentMemory(...) seeds the inner agent's history from the conversation summary.
  • Notebook 26 — runnable end-to-end tour.

Docs

  • New Agent section (6 pages): Overview, Examples, Streaming, With RAG, With Tools, Memory, Sessions.
  • New Super RAG section (6 pages): Overview, Standalone, Files & Chunks, With Agent, With Deep Agents, Adapters, API.
  • New DeepAgent page. Reference
  • Parameters Reference — every constructor parameter for every agent type and key class. Reference
  • Updated Architecture + Model Adapters reference pages.
  • Updated quickstart with Agent / Deep Agent / RAG sections.
  • Updated FAQ with "Agent types — which one should I use?".
  • 5 new notebooks (22–26): RAG basics, RAG + Agent, RAG + Deep Agents, DeepAgent chat, Agent memory.
  • Full-width docs layout + collapsible TOC with floating toggle, persistence via localStorage.

Build

  • shipit-chat script entry point.
  • Granular extras: rag, rag-openai, rag-cohere, rag-chroma, rag-qdrant, rag-pgvector, rag-drk-cache, rag-pdf, rag-docx, rag-rerank-cohere, rag-rerank-cross-encoder, plus bedrock, google, groq, together, ollama. The all extra bundles everything.

Fixed

  • Tool schema format bugRAGSearchTool, RAGFetchChunkTool, RAGListSourcesTool, WebhookPayloadTool now use the wrapped {"type": "function", "function": {...}} shape. Previously they were returning flat dicts and Bedrock's Converse API was rejecting them with empty-name validation errors. New regression test scans every tool for Bedrock compatibility.
  • memory=AgentMemory type coercionDeepAgent and GoalAgent no longer auto-assign AgentMemory.knowledge (a SemanticMemory) into memory_store= (which expects a MemoryStore). memory= now only seeds history; users pass memory_store= explicitly for the runtime's memory tool.
  • Agent.with_builtins(tools=[...]) keyword collision — the method now accepts and merges user tools= with the builtin catalogue (last-write-wins on name collision).
  • AgentDelegationTool streaming — uses inner agent's stream() and packs events into tool_completed.metadata['events'].

Test coverage

  • 521 unit tests (up from 285) — green.
  • 19 end-to-end Bedrock smoke tests in scripts/smoke_bedrock_e2e.py cover every public surface end-to-end against real Bedrock.

v1.0.2 — 2026-04-10

Major feature release. Deep agents, structured output, pipelines, agent teams, advanced memory, output parsers, and runtime power features. 285 tests. 12 examples. 8 notebooks. 13 new doc pages.

Deep Agents

  • GoalAgent — Autonomous goal decomposition with success criteria, streaming, and .with_builtins(). Guide
  • ReflectiveAgent — Self-evaluation with quality scores and revision loop. Guide
  • Supervisor / Worker — Hierarchical delegation with quality review. Guide
  • AdaptiveAgent — Runtime tool creation from Python code. Guide
  • PersistentAgent — Checkpoint and resume across sessions. Guide
  • Channel / AgentMessage — Typed agent-to-agent communication. Guide
  • AgentBenchmark — Systematic agent testing framework. Guide
  • Deep Agents API Reference — Full constructor, method, and return type docs. Reference

Structured Output & Parsers

  • output_schema on Agent.run() — Pydantic models + JSON schemas. Guide
  • JSONParser, PydanticParser, RegexParser, MarkdownParser. Guide

Composition

  • Pipeline — Sequential, parallel, conditional, function steps, streaming. Guide
  • AgentTeam — LLM-routed multi-agent coordination with streaming. Guide

Advanced Memory

  • ConversationMemory — buffer/window/summary/token strategies. Guide
  • SemanticMemory — Embedding-based vector search. Guide
  • EntityMemory — Track people, projects, concepts. Guide
  • AgentMemory — Unified interface with .default(). Guide

Runtime Power Features

  • Parallel tool execution. Guide
  • Graceful tool failure. Guide
  • Context window management. Guide
  • Hooks & middleware. Guide
  • Mid-run re-planning. Guide
  • Async runtime. Guide
  • Transient error auto-retry (429/500/503).

Changed

  • Selective memory storage (breaking) — Only persist=True tool results stored.
  • Safer retry defaults(ConnectionError, TimeoutError, OSError) instead of (Exception,).

v1.0.1 — 2026-04-09

Maintenance release. Bug fix in the tool runner plus repo hygiene, contributor experience, and CI hardening. Strongly recommended upgrade from 1.0.0 if you use Bedrock gpt-oss-120b.

Fixed

  • ToolRunner argument collision — Fixed TypeError: got multiple values for argument 'context' when an LLM (notably bedrock/openai.gpt-oss-120b-1:0) emits context as a tool-call argument. The runner now strips reserved argument names (context, self) from tool-call arguments before forwarding. Affects every built-in tool.

Added

  • CHANGELOG.md at repo root in Keep a Changelog format
  • CONTRIBUTING.md with dev setup, commit conventions, PR checklist, and "how to add a new LLM adapter / tool" guides
  • GitHub issue templates — structured bug report, feature request, and config forms
  • PR template with 12-item verification checklist
  • Test CIpytest -q on Python 3.11 + 3.12 × Ubuntu + macOS (4 matrix cells), with smoke-test of all 11 LLM adapter imports
  • Gitleaks secret scanning CI with SARIF upload to GitHub Security tab, inline PR comments, Actions summary
  • Pre-commit hooks — trailing whitespace, EOF fixer, YAML/TOML validation, gitleaks v8.21.2, ruff lint + format
  • Gitleaks allowlist for runtime tool outputs (scraped HTML contains false-positive "API keys" like Pushly domainKeys)

Changed

  • .gitignore rewritten to dedupe entries and cover all runtime directories (site/, .eggs/, pip-wheel-metadata/)
  • Runtime tool outputs untracked from git (sessions/, traces/, memory.json, .shipit_notebooks/**) — they were accidentally committed in 1.0.0

Security

  • Added CI and pre-commit secret scanning to prevent future credential leaks
  • No runtime code changed — shipit_agent/ module is byte-identical to 1.0.0

v1.0.0 — 2026-04-09

First stable release. Focused on making the agent loop observable, interchangeable, and out of the way.

🧠 Live reasoning / thinking events

  • LLMResponse.reasoning_content field added to carry thinking/reasoning blocks from any provider
  • New _extract_reasoning() helper handles three shapes:
    • Flat reasoning_content on the response message (OpenAI o-series, gpt-oss, DeepSeek R1, Anthropic via LiteLLM)
    • Anthropic thinking_blocks[*].thinking (Claude extended thinking)
    • model_dump() fallback for pydantic dumps
  • Runtime emits reasoning_started + reasoning_completed events whenever reasoning content is non-empty
  • All three LLM adaptersOpenAIChatLLM, AnthropicChatLLM, LiteLLMChatLLM / BedrockChatLLM — share the extraction helper
  • OpenAIChatLLM auto-passes reasoning_effort="medium" for reasoning-capable models (o1*, o3*, o4*, gpt-5*, deepseek-r1*)
  • AnthropicChatLLM supports thinking_budget_tokens=N to enable Claude extended thinking

⚡ Truly incremental streaming

  • agent.stream() now runs the agent on a background daemon thread
  • Events are pushed through a thread-safe queue.Queue as they're emitted
  • Consumer loop yields events the instant they happen — no buffering, no batched delivery
  • Worker exceptions are captured and re-raised on the consumer thread
  • Works in Jupyter, VS Code, JupyterLab, WebSocket/SSE transports, and plain terminals

🛡️ Bulletproof Bedrock tool pairing

  • Planner output is now injected as a user-role context message rather than an orphan role="tool" message — fixes Bedrock's "number of toolResult blocks exceeds number of toolUse blocks" error
  • Every response.tool_calls entry gets a tool-result message unconditionally:
    • Success → real tool-result
    • Retry → retries first, then final result or error
    • Unknown tool → synthetic "Error: tool X is not registered" tool-result
  • Stable call_{iteration}_{index} tool_call_ids round-trip through message metadata
  • Multi-iteration tool loops on Bedrock Claude, gpt-oss, and Anthropic native now work without modify_params band-aids

🔑 Zero-friction provider switching

  • build_llm_from_env() walks upward from CWD to discover .env, so notebooks and scripts work regardless of where they're launched from
  • Seven providers: openai, anthropic, bedrock, gemini, vertex, groq, together, ollama, plus a generic litellm provider
  • Per-provider credential validation with clear error messages
  • SHIPIT_OPENAI_TOOL_CHOICE=required env var to force tool use on lazy models like gpt-4o-mini

🌐 In-process Playwright for open_url

  • OpenURLTool now uses Playwright's sync Chromium directly (headless, realistic desktop Chrome UA, 1280×800 viewport)
  • Handles JS-rendered pages, anti-bot 503s, modern TLS/ALPN
  • Stdlib urllib fallback when Playwright is not installed — zero third-party HTTP dependencies in the core fallback path
  • Errors never raise out of the tool: they return as ToolOutput with a warnings list in metadata
  • Rich metadata: fetch_method, status_code, final_url, title

🔍 Upgraded ToolSearchTool

  • Replaced binary substring match with drk_cache-style fuzzy scoring: SequenceMatcher.ratio() + 0.12 × token_hits
  • Configurable limit parameter, clamped to [1, max_limit]
  • New init kwargs: max_limit, default_limit, token_bonus
  • Structured error output for empty queries
  • Ranked output with scores and "when to use" hints from prompt_instructions
  • Noise filter: results below score=0.05 dropped

🪵 Full event taxonomy

14 distinct event types with documented payloads:

run_started, mcp_attached, planning_started, planning_completed, step_started, reasoning_started, reasoning_completed, tool_called, tool_completed, tool_retry, tool_failed, llm_retry, interactive_request, run_completed

🔁 Iteration-cap summarization fallback

  • If the model is still calling tools when max_iterations is reached, the runtime gives it one more turn with tools=[] to force a natural-language summary
  • run_completed is never empty for normal runs
  • Guarded with try/except so summarization failures can't mask the rest of the run

Other changes

  • pyproject.toml: [project.urls] now points to correct GitHub org, adds Documentation and Changelog links
  • .env.example: expanded with all new env vars documented
  • notebooks/04_agent_streaming_packets.ipynb: full rewrite with .env loading, credential visibility printer, and live Markdown updates
  • README.md: new v1.0 release section with 8 headline features
  • Full MkDocs Material documentation site at shipiit.github.io/shipit_agent

Breaking changes

None — this is the first stable release. Subsequent 1.x releases will maintain backward compatibility within the 1.x line.