Changelog

Name: SHIPIT Agent
Author: SHIPIT

35 min read

98 sections

v1.0.16 — 2026-07-10

The live experience — streaming, cancellation, and Claude-Code-grade ergonomics. Real token streaming everywhere, one-call live runs with rich tool cards, safe stops, stale-proof edits, model-written compaction, and a sharper CLI. Works with any LLM provider. 1969 tests passing (+13 new). 0 regressions.

Token streaming, everywhere

OpenAIChatLLM streams for real (was a silent TODO): tokens hit the callback as generated, tool-call fragments stitched by index, usage captured from the final chunk; gateways that ignore stream=True degrade gracefully. Lights up Gemma 4 on Bedrock mantle, Groq, and every OpenAI-compatible endpoint.
AnthropicChatLLM streams via the SDK's messages.stream helper — all existing parsing (thinking blocks, tool use, server tools, citations) unchanged.

The live experience

Agent.run_live(prompt) — tokens print as generated, tool calls render as cards with args/status/duration, a ✔ done footer closes the run; returns the final answer text.
StreamRenderer — the underlying renderer for custom loops; style="rich" (automatic on TTYs) draws Claude-Code-style ⏺/⎿ cards with ANSI colors; prints the answer at the end for non-streaming adapters.
agent.cancel() — thread-safe ESC: stops at the next checkpoint, emits run_cancelled, returns normally with metadata["cancelled"]; skipped batch tools get synthetic results so message pairing stays valid.

Reliability

Edit hardening — edit_file blocks when the file changed on disk after the last read_file (external modification → re-read hint) and returns a compact unified diff with every patch (metadata["diff"]).
LLM-powered compaction — near the context window, old turns are summarized by the model (decisions, facts, paths, open threads; ~300 words) with a mechanical fallback; the context_compacted event now fires reliably.

CLI

Live StreamRenderer turns (real tokens + cards; spinner retired), inline [y]es / [n]o / [a]lways prompts for ask-gated tools (session-persistent always-allows), and --continue to resume the most recent session (~/.shipit/sessions). Fixed a --session-dir crash.

Examples & notebooks

examples/23_bedrock_model_switching.py — Gemma 4 26B ↔ gpt-oss-120B, one function, live-verified.
notebooks/71_full_test_drive.ipynb — 13 in-depth sections exercising every capability, executed end-to-end (live Bedrock cells + offline).

v1.0.15 — 2026-07-10

The Super Agent — every sector, clean logs, real deliverables. One release that makes a shipit agent useful to a finance analyst, a marketer, an engineer, a designer, a researcher, and a sales rep alike — and makes every run readable. All of it works with any LLM provider.

Sector specialists — `Agent.for_role`

One line to a specialist — Agent.for_role("finance-analyst", llm=llm) turns any of the 40+ prebuilt role definitions (finance, marketing, engineering, design, research, sales, support, HR, …) into a runnable agent: the role's prompt, its matching builtin tools, and its iteration budget.
Did-you-mean errors — unknown ids raise a ValueError listing the closest matching roles.
Deliverable-ready roles — 14 specialists (finance-analyst, marketing-writer, researcher, data-analyst, sales roles, …) now carry the new build_document tool.

Prebuilt MCP catalog — `connect_mcp`

12 well-known servers by name — connect_mcp("github"), connect_mcp("filesystem", args=["/repo"]), connect_mcp("postgres", args=[url]), plus slack, sqlite, puppeteer, brave-search, fetch, memory, sentry, gitlab, and google-maps — each on a persistent stdio transport.
Fail-fast validation — required env vars and the launcher binary (npx/uvx) are checked before anything starts; misconfiguration is one clear message.
Resilient MCP calls — a failing MCP tool call (server down, timeout) now returns a readable tool result the model can react to instead of crashing the run. MCPStdioTransport / PersistentMCPSession aliases are exported.

Polished documents — `build_document`

Five formats — PDF reports, Excel workbooks, Word documents, PowerPoint decks, and styled HTML from one structured payload (title + sections, or sheets for Excel).
Finished, not generated — accent-colored headings, zebra-striped tables, bold frozen header rows, auto-sized columns; Excel cells starting with = become live formulas.
Optional dependencies — renderers use reportlab / openpyxl / python-docx / python-pptx and reply with the exact pip install fix when one is missing; HTML needs nothing.

Clean tool-call logs — `format_activity`

Claude-Code-style tool cards — format_activity(result) renders each call as ⚙ name(args) ✓ 228ms with a compact output preview and a run summary footer; format_event_line(event) does the same live for streams.
Timing built in — every AgentEvent now carries a timestamp; tool_completed / tool_failed carry the tool name and duration_ms.

Scheduled jobs — `AgentScheduler`

Cron for agents — sched.add(prompt, every=3600), at="09:00" daily, or cron="0 8 * * 1" (optional croniter); run_forever() fires jobs as they come due.
Durable jobs — pass store=SQLiteJobStore() and due times + run counts persist across restarts; a re-added job resumes its slot instead of resetting.
Production niceties — on_result callbacks, max_runs caps, session-backed runs, and injectable clock/sleep so schedules are unit-testable with zero real waiting.

MCP, deeper — resources, prompts, streamable HTTP

Resources & prompts — server.list_resources() / read_resource(uri) and list_prompts() / get_prompt(name, args); server.resource_tool() gives the model a tool to browse/read a server's resources. Servers that don't implement them return empty lists, not errors.
Streamable HTTP transport — MCPStreamableHTTPTransport speaks the 2025 spec revision: JSON and SSE responses, Mcp-Session-Id affinity, and bearer_token= on both HTTP transports for OAuth-protected servers.

Run metrics & live-updatable events

result.summary() — wall-clock duration, iterations, token usage, and a per-tool breakdown (calls / failures / total ms) in one dict.
Correlation ids — tool_called / tool_completed / tool_failed / tool_retry share a call_id, so live UIs can update one tool card in place (running → ✓/✗) instead of appending lines.

Background subagents & context compaction

Parallel delegation — sub_agent accepts background=true (returns a task id immediately, runs on a thread pool) and collect="task-N" to fetch the result — Claude-Code-style task fan-out.
Observable compaction — when a run approaches the context window, older turns are summarized (user/assistant content included, not dropped) and a context_compacted event reports before/after message counts.

See the Super agent guide for the full tour.

v1.0.14 — 2026-06-13

The SHIPIT Workspace — point an agent at a repo and it just works. v1.0.14 turns a repository into the agent's control surface: drop a few conventional files and every agent rooted there picks them up — no glue code. Project instructions, file-based /slash commands, and a checked-in permission/env policy load automatically, and a new TodoTool keeps long runs observable. None of it is provider-specific — it works with any LLM you pass.

Project memory — `SHIPIT.md` / `AGENTS.md`

Auto-loaded instructions — a SHIPIT.md (or AGENTS.md, or .shipit/SHIPIT.md) at the repo root is loaded into the system prompt for any Agent(llm=llm, project_root="."). A user-global ~/.shipit/SHIPIT.md applies everywhere; every matching file is included, project context first.
@path imports — instruction files pull in others with @relative/path (depth-limited, cycle-safe), resolved relative to the importing file.
Opt out / direct API — auto_project_memory=False skips the injection; load_project_memory(project_root, ...) from shipit_agent.workspace returns the assembled block yourself.

Slash commands — `.shipit/commands/`

File-based commands — drop .shipit/commands/<name>.md and run agent.run("/<name> ...args"); the body becomes the prompt with $ARGUMENTS and $1 / $2 substitution. Leading YAML frontmatter is stripped; unknown /cmd text passes through unchanged. discover_commands() / expand_command() expose the same logic.

Settings — `.shipit/settings.json`

Declarative policy — check model, permissions (mode / allow / deny / ask), and env into the repo; a user-global ~/.shipit/settings.json merges underneath. load_settings() reads it and WorkspaceSettings.to_permission_engine() wires it into the control plane.
Agent.for_project(llm=..., project_root="/repo") — one call that loads settings → permission engine, attaches builtin tools, and enables project memory + /slash commands. Works with any LLM provider. See Agent → The SHIPIT Workspace.

`TodoTool` — live task tracking

The SHIPIT TodoWrite — a live, replace-on-write checklist (todos: {content, status} with pending / in_progress / completed) the model maintains while it works, rendered as a glyph checklist and stored on context.state["todos"] with summary metadata. Included in Agent.with_builtins() (and for_project), so long agentic runs stay observable. See Tools → TodoTool.

v1.0.12 — 2026-06-07

Claude API power — plus cross-provider prompt-cache accounting. v1.0.12 adds the Anthropic API's highest-leverage server features as first-class passthroughs — server-side tools, document citations, the Batch API, and interleaved thinking + server-side context editing — on top of v1.0.11's control plane. It also makes prompt caching honestly cross-provider: OpenAI's automatic cache reads are now surfaced for cost tracking. Each feature is honest about provider support; no public API was removed.

Server-side tools (Anthropic-hosted)

web_search(), code_execution(), computer_use(), bash(), text_editor() from shipit_agent.llms — declare them in the tools= list you pass to AnthropicChatLLM.complete(...) (mixed freely with client-side tools). They run inside Anthropic's own sandbox — zero local infrastructure, no client-side tool loop.
Beta headers handled automatically — code_execution and computer_use attach their betas and route to the beta endpoint; web_search is GA and stays on the GA endpoint.
Surfaces in metadata — LLMResponse.metadata["server_tool_use"] and ["server_tool_results"], only when present.
Provider note — these are Anthropic API shapes (also reachable for Anthropic models via Bedrock / LiteLLM); other providers use their own native server tools. See Agent → Server-side tools.

Citations & the Batch API

Citation document helpers — text_document / pdf_document / url_pdf_document / content_document from shipit_agent.llms, with citations.enabled on by default. Claude grounds its answer in the document and the cited spans are parsed into metadata["citations"] — verifiable RAG.
Batch API runtime — BatchRequest, BatchResult, and BatchRuntime.run(...) (in shipit_agent.batch) wrap Anthropic's Messages Batches API for bulk, latency-tolerant runs at roughly 50% of standard per-token price. submit / status / results / cancel are exposed too.
Provider note — Anthropic citations and Anthropic batches today; OpenAI also has a Batch API and generalising the runtime is on the roadmap. See Agent → Citations & Batch API.

Interleaved thinking & context editing

AnthropicChatLLM(interleaved_thinking=True, thinking_budget_tokens=…) — the model thinks between tool calls. The interleaved-thinking-2025-05-14 beta attaches only when both are set; metadata["thinking_blocks"] carries the signed thinking blocks for round-tripping.
context_management= — forwarded as Anthropic's context_management request param (with its beta header) so the API clears stale tool results server-side.
Provider note — extended / interleaved thinking is Anthropic; OpenAI reasoning models and Gemini thinking are the equivalents elsewhere (reasoning content is captured for all of them). See Agent → Interleaved thinking & context editing.

Cross-provider prompt caching

OpenAI cached-token surfacing — OpenAI does automatic prompt caching; shipit now reads usage.prompt_tokens_details.cached_tokens into usage["cache_read_input_tokens"] — the same key the CostTracker reads for Anthropic — so OpenAI cache reads bill at the cheaper rate. LiteLLMChatLLM forwards both shapes. Anthropic / Bedrock / Vertex keep explicit cache_control breakpoints (default on for Claude). Caching is cross-provider, not Anthropic-only. See Agent → Prompt caching.

v1.0.11 — 2026-06-07

A control plane for tool calls — plus prompt caching and a memory tool. v1.0.11 brings the Claude Code permission layer to the library: declarative allow/deny/ask rules, read-only plan mode, human-in-the-loop callbacks, and blocking/rewriting hooks. It also turns on Anthropic prompt caching by default for Claude-family models and adds an Anthropic-style memory tool. Folded in is the 1.0.10 bug-fix & hardening work. No public API was removed.

Control plane — permissions, plan mode & hooks

Agent(permission_mode=...) — "default", "acceptEdits", "plan", or "bypass", mirroring Claude Code. plan makes a run read-only; acceptEdits auto-approves file edits.
PermissionEngine(allow=[...], deny=[...], ask=[...]) — fnmatch globs on tool name with a predictable precedence: deny > mode > allow > ask > callback > default. Pass it as permissions= (also accepts a bare mode string or a kwargs dict).
agent.plan(prompt) — one-call read-only planning: the agent may use read-only tools and writes a step-by-step plan instead of acting.
permission_callback(name, args) -> PermissionResult | None — programmatic human-in-the-loop approval, consulted on ask rules and as a catch-all.
Blocking & rewriting hooks — @hooks.on_before_tool may return {"decision": "deny"} to block a call or a PermissionResult with updated_arguments to rewrite it; @hooks.on_user_prompt can redact the incoming prompt. Hooks remain observe-only when they return None.
Denied calls are visible — a blocked tool emits a tool_denied event and feeds the model a was NOT run tool message so it can recover.
New top-level exports: PermissionEngine, PermissionResult, PermissionDecision. See Agent → Permissions, plan mode & blocking hooks.

Prompt caching

On by default for Claude-family models. AnthropicChatLLM(prompt_caching=True) and LiteLLMChatLLM(prompt_caching=True) cache the stable prefix (system prompt + tools) on Anthropic, Bedrock, and Vertex via cache_control breakpoints.
Cost-aware — usage["cache_read_input_tokens"] and usage["cache_creation_input_tokens"] flow into CostTracker; cache reads bill at roughly 10% of input. See Agent → Prompt caching.

Claude-style memory tool

ClaudeMemoryTool — Anthropic's memory_20250818 tool shape: a single command-driven tool (view / create / str_replace / insert / delete / rename) over a sandboxed memory directory (.shipit_workspace/memories by default) for cross-session learning. Attach via Agent(tools=[ClaudeMemoryTool(...)]). See Tools → Claude-style memory tool.

Hardening (folded in from 1.0.10)

text_delta_callback regression (v1.0.9) fixed — the runtime now inspects the adapter signature and only passes the callback to adapters that accept it, so every custom adapter works unchanged.
Multi-turn sessions no longer stack duplicate system prompts when reusing a session_store + session_id.
Tool security — Bash rejects command/process substitution and redirection; open_url is http(s)-only and blocks SSRF targets; the SQL read-only guard scans the whole statement and rejects stacked statements; OAuth validates the CSRF state nonce; edit_file refuses non-UTF-8 files; FileCredentialStore chmods 0600 and writes atomically.
Reliability — MCP transports close on error; parallel tools get isolated state; the iteration-cap turn is now accounted for; CostTracker flags has_unknown_pricing instead of silently billing $0.
1742 tests passing (+180 new). 0 regressions.

v1.0.9 — 2026-05-14

Inline text streaming + multimodal media references. Two features that make shipit feel live in chat UIs: token-by-token text streaming for real-time typing, and first-class image/file references in prompts.

Inline text streaming

LLM.complete(text_delta_callback=…) — stream assistant text token-by-token as it's generated, instead of waiting for the full response. The callback fires for each incremental text chunk.
AgentRuntime emits text_delta events — drive SSE or WebSocket consumers directly from the event stream for ChatGPT-style live typing in the browser.
Non-streaming behavior preserved — streaming is opt-in per call; omit the callback and complete() behaves exactly as before.
Implemented end-to-end for LiteLLM, and no-op-compatible for the other adapters (they return the full text in a single delta), so nothing breaks if a backend can't stream.

Multimodal media references

MediaReference — reference an image or file inside a prompt without inlining bytes; the runtime resolves it at send time.
MediaStore — pluggable storage for media, with InMemoryMediaStore for tests and short-lived runs and FileMediaStore for on-disk persistence.
extract_media_refs + build_multimodal_message — pull media references out of a prompt and assemble the provider-native multimodal message payload.
See Agent → Multimodal chat.

v1.0.8 — 2026-05-09

Structured output overhaul + verifier network. Two flagship features that genuinely beat LangChain on the surfaces it tries hardest at — and ship more broadly applicable wins than v1.0.7's connector explosion.

Structured output — same-conversation validation retry

Agent.run(prompt, output_schema=MyModel, max_validation_retries=2) — pass a Pydantic model or JSON Schema dict; get back a typed result.parsed.
Auto-retry on validation failure inside the same conversation — when the first parse fails, the runtime appends the bad assistant turn + a corrective user turn ("that response could not be parsed: …") and retries. No separate "fixing LLM" call (LangChain's OutputFixingParser requires one).
Streaming partial JSON parser — parse_partial_json('{"a": "hel') returns {"a": "hel"}. StructuredOutput.stream(prompt) yields progressively richer dicts as tokens arrive, then a final validated typed object.
StructuredOutput — standalone wrapper for one-shot extraction without the agent loop; same retry path, exposed publicly.
New top-level exports: StructuredOutput, StructuredOutputResult, parse_partial_json. New result.parsed and result.output corrected text semantics on Agent.run.
See Agent → Structured output.

Verifier network — process supervision

Agent(verifier=VerifierNetwork(llm=cheap_llm)) — a second cheap LLM vetoes hallucinated tool calls before they fire and rates progress between iterations. Both checks fail open (verifier failures never block the agent).
Pre-tool veto — wraps every tool. Verifier returns allow | veto | rewrite; vetoed calls become synthetic error tool-results so the agent re-plans without the bad action having actually run.
Progress check — after each iteration, scores progress 0-1. When the score stays below progress_threshold for progress_window consecutive iterations, maybe_nudge() returns a "you're stalling" message you can inject as a user turn.
Confidence-gated — verdicts below veto_min_confidence get downgraded to ALLOW (avoid over-blocking on uncertain calls).
Hard caps — max_pretool_calls_per_run, max_progress_calls_per_run so the verifier itself can't run away on cost.
Telemetry — verifier.stats exposes per-run counters: vetoes, rewrites, nudges, score history.
LangGraph's ToolNode has no per-call gating. LangChain's RunnableWithMessageHistory has no progress detector. Process supervision in shipit is one constructor argument.
See Agent → Verifier network.

Episodic memory consolidation

MemoryConsolidator(llm=cheap_llm).consolidate(memory=..., recent_messages=...) — LLM distills the last conversation into 3-8 durable facts and writes them to SemanticMemory. Categories (preference, project, goal, person, other) are tracked for filtering.
Forgetting curve — consolidator.decay(knowledge, half_life_days=14) applies exponential decay to fact strength and prunes facts below forgetting_threshold. Pure local arithmetic; no LLM call.
Core memory — consolidator.core_memory(knowledge, top_k=5) returns the top-K most-load-bearing facts ranked by strength + 0.1·log1p(retrievals). Inject into the system prompt every turn for ChatGPT-style "remembers things across sessions".
Retrieval bumping — consolidator.record_retrieval(knowledge, [fact_texts]) increments retrieval counts. Frequently-retrieved facts rise to core memory automatically.
New top-level exports: MemoryConsolidator, DistilledFact, ConsolidationResult.
ChatGPT's Memories feature is add_fact(text) with no decay, no retrieval-based promotion. Ours is principled and self-hostable.
See Agent → Episodic memory consolidation.

Time-travel replay

TraceReplayer.from_store(store, trace_id) — load any saved trace and walk events programmatically. Three constructors: from_record, from_store, from_file.
replayer.fork(at_event=N, edit_user_message='...') — capture the conversation state at any event, optionally with a tweaked user prompt. Returns a ReplayCheckpoint.
checkpoint.continue_from(agent=fresh_agent) — resume the run on a fresh Agent, with agent.history pre-filled. Forwards arbitrary Agent.run kwargs (e.g. output_schema=).
diff_traces(left, right) — side-by-side comparison. Reports matched events, divergence point, type mismatches, and only-in-left / only-in-right tails. .to_lines() for human-readable rendering.
New top-level exports: TraceReplayer, ReplayCheckpoint, ReplayResult, ForkPoint, TraceDiff, diff_traces.
LangSmith's Playground is SaaS-only. Inngest's branching is SaaS-only. Ours is library-level, open-source, and works against your existing FileTraceStore.
See Agent → Time-travel replay.

ComputerUseAgent — browser automation

ComputerUseAgent(llm=, browser=, goal=) — screenshot → reason → act loop. Show a screenshot to a vision-capable LLM, parse a structured action back, execute, repeat until DONE.
PlaywrightBrowserSession.launch(headless=True) — production driver. Context-manager support; pip install playwright && playwright install chromium to enable.
MockBrowserSession — deterministic test double that records every call. Unit-test computer-use logic without spawning a browser.
Two action emit shapes — Anthropic's native computer-use tool (structured tool_use block) AND plain-text fallback (ACTION: click 100,200) for any vision LLM.
parse_action(raw) — pure parser, no IO. Handles both shapes plus prose-wrapped responses.
Recovery — when an action raises, the agent surfaces the error back to the model as a user message. Production-ready resilience without extra code.
New top-level exports: ComputerUseAgent, BrowserSession, MockBrowserSession, PlaywrightBrowserSession, ComputerAction, ComputerUseResult, ActionKind, ActionRecord, parse_action.
Devin / Multi-On / OpenAI Operator are SaaS products. Ours is a library — self-host, plug into your own loop, fork the implementation.
See Agent → ComputerUseAgent.

Tests + docs

+318 unit tests (1190 → 1508), zero regressions, all old tests still pass.
Five new notebooks — 54_structured_output_with_retry.ipynb, 55_verifier_network.ipynb, 56_episodic_memory_consolidation.ipynb, 57_time_travel_replay.ipynb, 58_computer_use_agent.ipynb.
Five new docs pages with full API reference, configuration deep dives, cost analysis, real-life examples, and beat-LangChain / Operator / ChatGPT comparison tables.

v1.0.7 — 2026-04-24

Agents for every role. Twelve new tools, nine new persona specialists, seven persona walk-through notebooks. shipit-agent is no longer only a developer-agent framework — it ships agents for developers, designers, sales reps, PMs, data analysts, finance, customer support, and recruiters. 1190 unit tests, 286 new in this release, zero regressions.

See RELEASE_NOTES_1.0.7.md for the full breakdown.

v1.0.6 — 2026-04-21

Autopilot — the long-running runtime. Plus 7 new role specialists, 3 new tools, and 8 new notebooks. Autopilot turns any agent into a budget-gated, checkpointed, streaming worker that runs until every success criterion is met. Fan-out dispatches N children in parallel. A reflection critic short-circuits the loop once a confident reviewer confirms the goal. Artifacts capture code blocks, markdown docs, and tool outputs as structured deliverables. A scheduler daemon drains a persistent JSON queue for 24-hour operation. 8 new Bedrock-Llama notebooks. 805 total tests. All passing.

Autopilot — long-running runtime

Autopilot(llm, goal, budget, …) — composes GoalAgent with budget gates, atomic checkpoints, heartbeats, and a live event stream.
BudgetPolicy(max_seconds, max_tool_calls, max_tokens, max_dollars, max_iterations) — every axis independently honored; set any to None / 0 to disable.
Goal-satisfaction termination, not step count. Stops the moment every criterion is verified OR any budget trips.
Atomic JSON checkpoints per iteration — ~/.shipit_agent/checkpoints/<run_id>.json. Crash → autopilot.resume(run_id) picks up at the next iteration.
autopilot.stream(run_id) — iterator of {kind, ...} events: autopilot.run_started, autopilot.iteration, autopilot.heartbeat, autopilot.criteria_satisfied, autopilot.budget_exceeded, autopilot.result.
default_heartbeat_stderr — drop-in sink. Custom callables (Slack / Datadog / webhook) just as easy.

Reflection critic

Critic(llm=..., confidence_threshold=0.75) — scores every iteration's output against the goal's criteria and feeds suggestions into the next iteration's prompt.
critic=True on Autopilot to use your run's LLM as a self-check; pass a Critic(llm=reviewer_llm) for a dedicated stronger reviewer.
Confidence-gated termination — flag-flips only count when the critic meets the confidence gate. Low-confidence "yes" still logs feedback but does not halt.
JSON-tolerant parsing — handles fenced ```json, extra prose, padding/trimming of criteria, and garbage input without raising.
New event kind autopilot.critic on every iteration's stream.

Artifacts — structured deliverables

ArtifactCollector — collects Artifact(kind, name, content, language, iteration) during the run.
Auto-extraction from every iteration's output — fenced code blocks (kind="code", with language) and top-level markdown docs (kind="markdown").
Tool-metadata ingestion — tools that declare {"artifact": True, "kind": ..., "name": ..., "content": ...} in their result metadata are captured explicitly.
Optional disk persistence — one JSON file per artifact, handy for CI build outputs.
New event kind autopilot.artifact; final AutopilotResult.artifacts carries the full list.

Parallel fan-out

autopilot.fanout(items, objective_template, criteria_template, max_parallel, child_budget_frac) — ThreadPoolExecutor-backed N-way parallelism.
Per-child budget scaling — each child inherits parent_budget * child_budget_frac (default 20%). Keeps aggregate spend bounded on 50-item batches.
autopilot.fanout_stream() — live per-child events (autopilot.fanout_child) for dashboard rendering.
FanoutResult(children, aggregated_output, wall_seconds, failed) — rolled-up status (completed | partial | failed), ordered children, default markdown digest or custom aggregator.

Scheduler daemon

SchedulerDaemon(llm_factory, queue_path, tick_seconds) — persistent JSON goal queue at ~/.shipit_agent/autopilot-queue.json.
enqueue(), list_queue(), remove(), run_once(), run_forever(). Stateless daemon; crash-safe.
Heartbeat events on idle so you can wire Slack / Datadog telemetry.
CLI: shipit autopilot, shipit daemon, shipit queue {add,list,remove}. Systemd / launchd / Docker recipes in the docs.

7 new role specialists

Engineering — generalist-developer, debugger
Design — design-reviewer
Product — product-manager
Go-to-market — sales-outreach, customer-success, marketing-writer
Auto-applied to agents.json on import — 40 → 47 specialists total.

3 new power tools

computer_use — drive the local desktop (screenshots, click, type, drag, key chords). Platform backends for macOS (cliclick / osascript), Linux (xdotool / scrot), Windows (PowerShell). Graceful install hints when a dep is missing.
hubspot_ops — HubSpot CRM v3 REST wrapper. Search / get / create contacts, companies, deals; attach notes. Auth via HUBSPOT_TOKEN env.
research_brief — one-call research primitive. Web search + top-page skim + numbered citations. No API key (DuckDuckGo HTML). Optional deep=True fetches each source page for richer summaries.

Notebooks — 37 through 44

37 — Autopilot quickstart.
38 — Live streaming (autopilot.stream(), render_stream, custom heartbeats).
39 — Persistence, resume, scheduler daemon.
40 — Developer / Debugger / Researcher specialists.
41 — Design / PM / Sales / CS / Marketing specialists.
42 — computer_use / hubspot_ops / research_brief tools.
43 — Fan-out · Critic · Artifacts.
44 — The Complete Tour — every feature end-to-end in one notebook.

All notebooks use build_llm_from_env("bedrock") — default is Bedrock Llama 4 Scout, matches the 01–36 series.

Other changes

AutopilotResult grew artifacts: list[dict] and critic_verdict: dict. Existing fields unchanged.
Autopilot accepts critic=True | Critic(...) and artifacts=True | ArtifactCollector(...).
Fan-out helpers (_scale_budget, _slug, _rollup_status) are pure functions — re-use them in your own dispatchers.
39 new tests (test_autopilot_artifacts.py, test_autopilot_critic.py, test_autopilot_fanout.py).

Second half — CostRouter, non-blocking ask_user, vision, sandbox, specialists-as-developers

The second half of v1.0.6 adds four more primitives that compose with Autopilot and an overhaul of the specialist tool presets so every role can actually execute code.

CostRouter — tiered LLM routing

shipit_agent.routing.CostRouter — drop-in LLM adapter that classifies each turn as easy / medium / hard and routes to the cheapest adequate model.
Heuristic classifier (classify_difficulty) with no extra LLM call — hard-keyword list + length thresholds + code-fence detection, tuned from real agent traces. Pass difficulty_fn=... to swap in your own oracle.
Tier(llm, price_per_1k, name) — wrap any shipit_agent LLM; price only drives the report, never the routing decision.
SpendReport — tier counts, estimated spend, "would-have-been" spend on hardest tier, and savings_pct. Populated live as the runtime calls complete() / stream().
force_tier=DifficultyTier.HARD for audits; fallback to MEDIUM when a classifier raises. Runs never die on classification errors.
Typical savings on 24h runs: 50–70%.

Non-blocking `ask_user_async`

New tool ask_user_async pauses an Autopilot run cleanly — does not block the loop.
File-based side channel at ~/.shipit_agent/askuser/<run_id>.json. Atomic rename on every write; crash-safe.
Autopilot integration — on every iteration, if a question is pending on the channel, the run halts with new status awaiting_user. resume() returns immediately while the channel is still pending; once answered, the loop continues.
shipit answer <run_id> "..." — CLI to reply. --index N targets a specific question; running shipit answer <run_id> with no text lists pending + answered history.
Multiple outstanding questions are supported; write_answer targets the latest by default.
SHIPIT_ASKUSER_DIR env redirects the channel (useful for tests and containerized runs).
Safe against path traversal — run_id is slugged before becoming a filename.

Vision feedback on `computer_use`

Every screenshot action now embeds the PNG's base64 bytes + media_type in result metadata, so a vision-capable LLM can actually reason over the captured image instead of just reading a file path.
4 MB cap — larger PNGs set vision=False + a vision_skip_reason; no context-window blow-ups.
Opt-out via vision=False kwarg on the tool call.
Read errors surface in vision_skip_reason rather than raising.

Docker sandbox on `code_execution`

sandbox=True on run_code runs the snippet inside an ephemeral container with --network none, a --read-only root filesystem, and a writable 64 MB /tmp tmpfs.
network=True opts back into bridge networking (rarely needed — isolation is the point).
image=... overrides the per-language default image.
Default images: python:3.11-slim, node:22-alpine, ruby:3.3-alpine, alpine:3.20 for shells, plus typescript / php / perl / lua / r. Override via SANDBOX_IMAGES.
Workspace mounted read-only at /work; snippet can read but not modify host files.
workspace_root kwarg points the tool at any user-chosen directory — per-call override of the shared default. Works in both sandbox and non-sandbox modes.
Graceful fallback when Docker isn't installed — returns metadata={"ok": False, ...} + a clear install hint. Runs never crash.

Specialists that run + test code

All seven role specialists (generalist-developer, debugger, design-reviewer, product-manager, sales-outreach, customer-success, marketing-writer) now ship with run_code + ask_user_async in their tool list.
Developer + debugger also keep bash + run_tests; designer gains computer_use; PM + sales + marketing gain research_brief; CS keeps hubspot_ops + gmail + slack.
Every run_code call accepts workspace_root so the user points the specialist at their project.

Notebook 45 + 3 new doc pages

Notebook 45 — 34 cells — 45_cost_router_async_ask_vision_sandbox.ipynb. Covers routing, async ask, vision, sandbox, workspace override, composed live streaming (critic + artifacts + router together), JSONL stream, parallel fan-out stream, and a specialist-as-developer example.
routing/cost-router.md — full CostRouter guide, custom classifier recipe, force_tier override, SpendReport schema.
autopilot/ask-user-async.md — side-channel anatomy, CLI + programmatic answer path, prompt-design rules.
tools/code-execution-sandbox.md — per-language image table, workspace_root use case, Docker-missing fallback.
tools/computer-use.md — new "Vision feedback" section with the metadata["vision"] contract and opt-out.

Tests

58 new tests (27 test_cost_router.py, 14 test_askuser_async.py, 5 test_computer_use_vision.py, 12 test_code_execution_sandbox.py).
Grand total: 863 tests, 0 failures.

New / changed public surface

Symbol	Where
`CostRouter`, `Tier`, `SpendReport`, `DifficultyTier`, `classify_difficulty`	`shipit_agent.routing`
`ask_question`, `write_answer`, `pending_questions`, `all_entries`, `channel_file`, `channel_dir`, `clear`	`shipit_agent.askuser_channel`
`AskUserAsyncTool`	`shipit_agent.tools.ask_user_async`
`AutopilotResult.status == "awaiting_user"`	`shipit_agent.autopilot.result`
`CodeExecutionTool.run(..., sandbox=True, network=True, image="...", workspace_root="...")`	`shipit_agent.tools.code_execution`
`build_sandbox_command`, `SANDBOX_IMAGES`, `SANDBOX_CMDS`	`shipit_agent.tools.code_execution.sandbox`
`ComputerUseTool.run(..., vision=False)` + `metadata["vision"/"image_base64"/"media_type"]`	`shipit_agent.tools.computer_use`
`shipit answer <run_id> [text] [--index N]`	CLI subcommand

v1.0.5 — 2026-04-18

Prebuilt agents, multi-agent crews, notifications, and cost tracking. 40 ready-to-use agent personas across 8 categories. DAG-based ShipCrew orchestration with sequential, parallel, and hierarchical modes. Slack, Discord, and Telegram notification hub. Real-time cost tracking with budget enforcement. 4 new notebooks, 4 new doc pages, 153 new tests. 706 total tests. All passing.

Prebuilt Agents — 40 Ready-to-Use Personas

shipit_agent.agents module — AgentDefinition dataclass + AgentRegistry for loading, searching, and composing agent personas.
40 agents across 8 categories: Architecture (5), Code Quality (6), Security (5), DevOps (5), Testing (5), Planning (4), Research (5), Content (5).
AgentRegistry.default() — loads in one line. Search, browse by category, merge with project-local agents.
.shipit/agents/ override — drop JSON files in your project; they override built-in agents with the same ID.

ShipCrew — Multi-Agent Crew Orchestration

ShipCrew, ShipAgent, ShipTask — DAG-based multi-agent crews with task dependencies.
Three execution modes: sequential, parallel (ThreadPoolExecutor), hierarchical (LLM-driven assignment + review).
Template variable resolution — {output_key} in descriptions auto-resolves from upstream outputs.
ShipAgent.from_registry() — load crew agents from the prebuilt registry.
Streaming — crew.stream() yields events for crew start, task start/complete/fail, crew complete.
Validation — cycle detection (Kahn's algorithm), missing agent checks, unknown dependency warnings.

Notification Hub — Slack, Discord & Telegram

SlackNotifier — Block Kit webhooks with color-coded severity. Zero external deps.
DiscordNotifier — rich embeds with inline metadata fields.
TelegramNotifier — Bot API with MarkdownV2 and auto-escaped special characters.
NotificationManager — multi-channel dispatch with min_severity and events filtering.
manager.as_hooks() — auto-notify on agent lifecycle events.

Cost Tracking & Budgets

CostTracker — real-time per-call cost tracking with 20+ model pricing table.
Budget(max_dollars=5.00) — enforcement with BudgetExceededError and on_cost_alert callback.
tracker.as_hooks() — automatic cost tracking from every LLM call.
Model aliases — "opus", "sonnet", "haiku" resolve to full model IDs.

Notebooks, Docs & Tests

4 notebooks: Prebuilt Agents (25 cells), ShipCrew (25 cells), Notifications (27 cells), Cost Tracking (31 cells).
4 doc pages: guides/prebuilt-agents.md, deep-agents/ship-crew.md, guides/notifications.md, guides/cost-tracking.md.
153 new tests (553 → 706 total). 29 new source files.

v1.0.4 — 2026-04-12

Skills, tools, and runtime power-up. All 32 tool prompts rewritten with decision trees and anti-patterns. Full skill-to-tool linking for all 37 packaged skills. Automatic iteration boost for skill-driven workflows. Expanded bash allowlist (50+ commands). Streaming, chat, and project-building examples across 3 notebooks. Comprehensive docstrings across every key module. 32 skill tests. All passing.

Skills — Full Tool Linking

37 skill tool bundles (up from 10) — every packaged skill now declares the built-in tools it needs. When a skill is selected, the agent auto-attaches the right tools.
Shared tool groups (_FILE_CORE, _CODE_CORE, _WEB_CORE) reduce duplication across bundles.
validate_tool_bundles() — new helper that checks every tool name in SKILL_TOOL_BUNDLES against the real builtin map.

Agent — Iteration Boost & Efficiency

_effective_max_iterations() — auto-boosts 4 → 8 when skills inject extra tools so skill-driven workflows can complete without cutting off early.
Single skill computation — run() and stream() now compute skills once and reuse (previously 3x per call).

Tool Prompts — All 32 Upgraded

Every tool's prompt.py rewritten with decision trees, anti-patterns, workflow guidance, and cross-tool coordination.

Bash Allowlist Expansion

50+ safe commands added: mkdir, touch, cp, mv, echo, grep, curl, docker, kubectl, terraform, aws, go, cargo, npx, tsc, eslint, black, isort, tree, awk, cut, diff, and more.

Documentation

Comprehensive docstrings on agent.py, builtins.py, skills/loader.py, skills/registry.py, skills/tool_bundles.py, deep_agent/factory.py.
6 tool doc pages updated with enhanced prompts.
Skills guide expanded with 7 real-world examples, streaming sections, chat sessions, and event type reference.
Notebook 27 rewritten (38 cells): streaming, chat streaming, project build, web scraping, DeepAgent chat.
Notebook 29 (new): DeepAgent + skills + memory + verify + reflect + sub-agents + streaming.
Notebook 30 (new): real-world full project build across 6 steps with 5 different skills.

Tests

15 new tests (17 → 32 total): iteration boost, bundle validation, chat sessions, streaming, chat streaming, memory + skills, DeepAgent chat/stream.

v1.0.3 — 2026-04-11

Major feature release. Super RAG subsystem, DeepAgent factory (verify / reflect / goal / sub-agents), live multi-agent chat REPL (shipit chat), Agent memory cookbook, plus deep docs + notebook coverage. 521 unit tests. 19 Bedrock end-to-end smoke tests. All passing.

Super RAG

shipit_agent.rag subsystem — pluggable chunker + embedder + vector store + keyword store + hybrid pipeline (vector + BM25 + RRF + recency bias + rerank + context expansion).
rag= on every agent type — auto-wires rag_search / rag_fetch_chunk / rag_list_sources tools, augments the system prompt with citation instructions, and attaches result.rag_sources with stable [N] citation indices.
Adapters — DrkCacheVectorStore (pgvector over psycopg2) + lazy Chroma / Qdrant / pgvector.
Thread-local per-run source tracker so concurrent runs never leak citations.

DeepAgent

shipit_agent.deep.DeepAgent — power-user factory bundling seven deep tools: plan_task, decompose_problem, workspace_files, sub_agent, synthesize_evidence, decision_matrix, verify_output. Guide
One-flag power features: verify=True, reflect=True, goal=Goal(...), rag=RAG(...), memory=AgentMemory(...).
agents= sub-agent delegation — plug any mix of agent types as named delegates via a built-in delegate_to_agent tool.
create_deep_agent() functional helper — auto-wraps plain Python callables as tools.
Nested event streaming — sub-agent events surface inside tool_completed.metadata['events'].

Live chat REPL

shipit chat — modern multi-agent terminal REPL. Switch agent types live, index files mid-session, save/load conversations, toggle reflect/verify, inspect tools and sources. Guide
Rich slash commands: /agent, /agents, /tools, /sources, /index, /rag, /goal, /reflect, /verify, /history, /save, /load, /reset, /info, …
Pluggable LLM provider via --provider; persistent sessions via --session-dir.

Streaming

DeepAgent.stream() covers every execution mode (direct, verified, reflective, goal-driven, sub-agent delegation).
PersistentAgent.stream() added with per-step checkpointing.
rag_sources event type added — emitted after every RAG-backed run.

Memory

Dedicated Agent → Memory cookbook explaining the two memory systems (memory_store= for the LLM's memory tool vs AgentMemory for application-curated profiles). Guide
DeepAgent auto-hydration — memory=AgentMemory(...) seeds the inner agent's history from the conversation summary.
Notebook 26 — runnable end-to-end tour.

Docs

New Agent section (6 pages): Overview, Examples, Streaming, With RAG, With Tools, Memory, Sessions.
New Super RAG section (6 pages): Overview, Standalone, Files & Chunks, With Agent, With Deep Agents, Adapters, API.
New DeepAgent page. Reference
Parameters Reference — every constructor parameter for every agent type and key class. Reference
Updated Architecture + Model Adapters reference pages.
Updated quickstart with Agent / Deep Agent / RAG sections.
Updated FAQ with "Agent types — which one should I use?".
5 new notebooks (22–26): RAG basics, RAG + Agent, RAG + Deep Agents, DeepAgent chat, Agent memory.
Full-width docs layout + collapsible TOC with floating toggle, persistence via localStorage.

Build

shipit-chat script entry point.
Granular extras: rag, rag-openai, rag-cohere, rag-chroma, rag-qdrant, rag-pgvector, rag-drk-cache, rag-pdf, rag-docx, rag-rerank-cohere, rag-rerank-cross-encoder, plus bedrock, google, groq, together, ollama. The all extra bundles everything.

Fixed

Tool schema format bug — RAGSearchTool, RAGFetchChunkTool, RAGListSourcesTool, WebhookPayloadTool now use the wrapped {"type": "function", "function": {...}} shape. Previously they were returning flat dicts and Bedrock's Converse API was rejecting them with empty-name validation errors. New regression test scans every tool for Bedrock compatibility.
memory=AgentMemory type coercion — DeepAgent and GoalAgent no longer auto-assign AgentMemory.knowledge (a SemanticMemory) into memory_store= (which expects a MemoryStore). memory= now only seeds history; users pass memory_store= explicitly for the runtime's memory tool.
Agent.with_builtins(tools=[...]) keyword collision — the method now accepts and merges user tools= with the builtin catalogue (last-write-wins on name collision).
AgentDelegationTool streaming — uses inner agent's stream() and packs events into tool_completed.metadata['events'].

Test coverage

521 unit tests (up from 285) — green.
19 end-to-end Bedrock smoke tests in scripts/smoke_bedrock_e2e.py cover every public surface end-to-end against real Bedrock.

v1.0.2 — 2026-04-10

Major feature release. Deep agents, structured output, pipelines, agent teams, advanced memory, output parsers, and runtime power features. 285 tests. 12 examples. 8 notebooks. 13 new doc pages.

Deep Agents

GoalAgent — Autonomous goal decomposition with success criteria, streaming, and .with_builtins(). Guide
ReflectiveAgent — Self-evaluation with quality scores and revision loop. Guide
Supervisor / Worker — Hierarchical delegation with quality review. Guide
AdaptiveAgent — Runtime tool creation from Python code. Guide
PersistentAgent — Checkpoint and resume across sessions. Guide
Channel / AgentMessage — Typed agent-to-agent communication. Guide
AgentBenchmark — Systematic agent testing framework. Guide
Deep Agents API Reference — Full constructor, method, and return type docs. Reference

Structured Output & Parsers

output_schema on Agent.run() — Pydantic models + JSON schemas. Guide
JSONParser, PydanticParser, RegexParser, MarkdownParser. Guide

Composition

Pipeline — Sequential, parallel, conditional, function steps, streaming. Guide
AgentTeam — LLM-routed multi-agent coordination with streaming. Guide

Advanced Memory

ConversationMemory — buffer/window/summary/token strategies. Guide
SemanticMemory — Embedding-based vector search. Guide
EntityMemory — Track people, projects, concepts. Guide
AgentMemory — Unified interface with .default(). Guide

Runtime Power Features

Parallel tool execution. Guide
Graceful tool failure. Guide
Context window management. Guide
Hooks & middleware. Guide
Mid-run re-planning. Guide
Async runtime. Guide
Transient error auto-retry (429/500/503).

Changed

Selective memory storage (breaking) — Only persist=True tool results stored.
Safer retry defaults — (ConnectionError, TimeoutError, OSError) instead of (Exception,).

v1.0.1 — 2026-04-09

Maintenance release. Bug fix in the tool runner plus repo hygiene, contributor experience, and CI hardening. Strongly recommended upgrade from 1.0.0 if you use Bedrock gpt-oss-120b.

Fixed

ToolRunner argument collision — Fixed TypeError: got multiple values for argument 'context' when an LLM (notably bedrock/openai.gpt-oss-120b-1:0) emits context as a tool-call argument. The runner now strips reserved argument names (context, self) from tool-call arguments before forwarding. Affects every built-in tool.

Added

CHANGELOG.md at repo root in Keep a Changelog format
CONTRIBUTING.md with dev setup, commit conventions, PR checklist, and "how to add a new LLM adapter / tool" guides
GitHub issue templates — structured bug report, feature request, and config forms
PR template with 12-item verification checklist
Test CI — pytest -q on Python 3.11 + 3.12 × Ubuntu + macOS (4 matrix cells), with smoke-test of all 11 LLM adapter imports
Gitleaks secret scanning CI with SARIF upload to GitHub Security tab, inline PR comments, Actions summary
Pre-commit hooks — trailing whitespace, EOF fixer, YAML/TOML validation, gitleaks v8.21.2, ruff lint + format
Gitleaks allowlist for runtime tool outputs (scraped HTML contains false-positive "API keys" like Pushly domainKeys)

Changed

.gitignore rewritten to dedupe entries and cover all runtime directories (site/, .eggs/, pip-wheel-metadata/)
Runtime tool outputs untracked from git (sessions/, traces/, memory.json, .shipit_notebooks/**) — they were accidentally committed in 1.0.0

Security

Added CI and pre-commit secret scanning to prevent future credential leaks
No runtime code changed — shipit_agent/ module is byte-identical to 1.0.0

v1.0.0 — 2026-04-09

First stable release. Focused on making the agent loop observable, interchangeable, and out of the way.

🧠 Live reasoning / thinking events

LLMResponse.reasoning_content field added to carry thinking/reasoning blocks from any provider
New _extract_reasoning() helper handles three shapes:
- Flat reasoning_content on the response message (OpenAI o-series, gpt-oss, DeepSeek R1, Anthropic via LiteLLM)
- Anthropic thinking_blocks[*].thinking (Claude extended thinking)
- model_dump() fallback for pydantic dumps
Runtime emits reasoning_started + reasoning_completed events whenever reasoning content is non-empty
All three LLM adapters — OpenAIChatLLM, AnthropicChatLLM, LiteLLMChatLLM / BedrockChatLLM — share the extraction helper
OpenAIChatLLM auto-passes reasoning_effort="medium" for reasoning-capable models (o1*, o3*, o4*, gpt-5*, deepseek-r1*)
AnthropicChatLLM supports thinking_budget_tokens=N to enable Claude extended thinking

⚡ Truly incremental streaming

agent.stream() now runs the agent on a background daemon thread
Events are pushed through a thread-safe queue.Queue as they're emitted
Consumer loop yields events the instant they happen — no buffering, no batched delivery
Worker exceptions are captured and re-raised on the consumer thread
Works in Jupyter, VS Code, JupyterLab, WebSocket/SSE transports, and plain terminals

🛡️ Bulletproof Bedrock tool pairing

Planner output is now injected as a user-role context message rather than an orphan role="tool" message — fixes Bedrock's "number of toolResult blocks exceeds number of toolUse blocks" error
Every response.tool_calls entry gets a tool-result message unconditionally:
- Success → real tool-result
- Retry → retries first, then final result or error
- Unknown tool → synthetic "Error: tool X is not registered" tool-result
Stable call_{iteration}_{index} tool_call_ids round-trip through message metadata
Multi-iteration tool loops on Bedrock Claude, gpt-oss, and Anthropic native now work without modify_params band-aids

🔑 Zero-friction provider switching

build_llm_from_env() walks upward from CWD to discover .env, so notebooks and scripts work regardless of where they're launched from
Seven providers: openai, anthropic, bedrock, gemini, vertex, groq, together, ollama, plus a generic litellm provider
Per-provider credential validation with clear error messages
SHIPIT_OPENAI_TOOL_CHOICE=required env var to force tool use on lazy models like gpt-4o-mini

🌐 In-process Playwright for `open_url`

OpenURLTool now uses Playwright's sync Chromium directly (headless, realistic desktop Chrome UA, 1280×800 viewport)
Handles JS-rendered pages, anti-bot 503s, modern TLS/ALPN
Stdlib urllib fallback when Playwright is not installed — zero third-party HTTP dependencies in the core fallback path
Errors never raise out of the tool: they return as ToolOutput with a warnings list in metadata
Rich metadata: fetch_method, status_code, final_url, title

🔍 Upgraded `ToolSearchTool`

Replaced binary substring match with drk_cache-style fuzzy scoring: SequenceMatcher.ratio() + 0.12 × token_hits
Configurable limit parameter, clamped to [1, max_limit]
New init kwargs: max_limit, default_limit, token_bonus
Structured error output for empty queries
Ranked output with scores and "when to use" hints from prompt_instructions
Noise filter: results below score=0.05 dropped

🪵 Full event taxonomy

14 distinct event types with documented payloads:

run_started, mcp_attached, planning_started, planning_completed, step_started, reasoning_started, reasoning_completed, tool_called, tool_completed, tool_retry, tool_failed, llm_retry, interactive_request, run_completed

🔁 Iteration-cap summarization fallback

If the model is still calling tools when max_iterations is reached, the runtime gives it one more turn with tools=[] to force a natural-language summary
run_completed is never empty for normal runs
Guarded with try/except so summarization failures can't mask the rest of the run

Other changes

pyproject.toml: [project.urls] now points to correct GitHub org, adds Documentation and Changelog links
.env.example: expanded with all new env vars documented
notebooks/04_agent_streaming_packets.ipynb: full rewrite with .env loading, credential visibility printer, and live Markdown updates
README.md: new v1.0 release section with 8 headline features
Full MkDocs Material documentation site at shipiit.github.io/shipit_agent

Breaking changes

None — this is the first stable release. Subsequent 1.x releases will maintain backward compatibility within the 1.x line.

v1.0.16 — 2026-07-10

Token streaming, everywhere

The live experience

Reliability

CLI

Examples & notebooks

v1.0.15 — 2026-07-10

Sector specialists — Agent.for_role

Prebuilt MCP catalog — connect_mcp

Polished documents — build_document

Clean tool-call logs — format_activity

Scheduled jobs — AgentScheduler

MCP, deeper — resources, prompts, streamable HTTP

Run metrics & live-updatable events

Background subagents & context compaction

v1.0.14 — 2026-06-13

Project memory — SHIPIT.md / AGENTS.md

Slash commands — .shipit/commands/

Settings — .shipit/settings.json

TodoTool — live task tracking

v1.0.12 — 2026-06-07

Server-side tools (Anthropic-hosted)

Citations & the Batch API

Interleaved thinking & context editing

Cross-provider prompt caching

v1.0.11 — 2026-06-07

Control plane — permissions, plan mode & hooks

Prompt caching

Claude-style memory tool

Hardening (folded in from 1.0.10)

v1.0.9 — 2026-05-14

Inline text streaming

Multimodal media references

v1.0.8 — 2026-05-09

Structured output — same-conversation validation retry

Verifier network — process supervision

Episodic memory consolidation

Time-travel replay

ComputerUseAgent — browser automation

Tests + docs

v1.0.7 — 2026-04-24

v1.0.6 — 2026-04-21

Autopilot — long-running runtime

Reflection critic

Artifacts — structured deliverables

Parallel fan-out

Scheduler daemon

7 new role specialists

3 new power tools

Notebooks — 37 through 44

Other changes

Second half — CostRouter, non-blocking ask_user, vision, sandbox, specialists-as-developers

CostRouter — tiered LLM routing

Non-blocking ask_user_async

Vision feedback on computer_use

Docker sandbox on code_execution

Specialists that run + test code

Notebook 45 + 3 new doc pages

Tests

New / changed public surface

v1.0.5 — 2026-04-18

Prebuilt Agents — 40 Ready-to-Use Personas

ShipCrew — Multi-Agent Crew Orchestration

Notification Hub — Slack, Discord & Telegram

Cost Tracking & Budgets

Notebooks, Docs & Tests

v1.0.4 — 2026-04-12

Skills — Full Tool Linking

Agent — Iteration Boost & Efficiency

Tool Prompts — All 32 Upgraded

Bash Allowlist Expansion

Documentation

Tests

v1.0.3 — 2026-04-11

Super RAG

DeepAgent

Live chat REPL

Streaming

Memory

Docs

Sector specialists — `Agent.for_role`

Prebuilt MCP catalog — `connect_mcp`

Polished documents — `build_document`

Clean tool-call logs — `format_activity`

Scheduled jobs — `AgentScheduler`

Project memory — `SHIPIT.md` / `AGENTS.md`

Slash commands — `.shipit/commands/`

Settings — `.shipit/settings.json`

`TodoTool` — live task tracking

Non-blocking `ask_user_async`

Vision feedback on `computer_use`

Docker sandbox on `code_execution`

🌐 In-process Playwright for `open_url`

🔍 Upgraded `ToolSearchTool`