Roadmap
The shipit_agent roadmap after v1.0.6 — evaluation harness, semantic skill retrieval, MCP auto-discovery, distributed worker pool, voice mode, agent marketplace, and fine-tune-friendly trace export. Each item with scope, estimated LOC, design notes, and tradeoffs.
What's next for shipit_agent. Each item below has concrete scope, an estimated LOC budget, a sketch of the public API, and the honest tradeoff a reader should know about before picking it up.
The items are grouped by tier the same way the in-session gap audit used:
- Tier 2 — differentiation. High daily-impact features that most production users will reach for once they exist.
- Tier 3 — bigger projects. Nice-to-have or niche; worth waiting for real demand before building.
Nothing here is committed. This document is a live record of what has been discussed but not yet built, so contributors can pick up any item without having to reconstruct the context.
Where v1.0.6 ended
Recap of what shipped: the Autopilot runtime, fan-out, reflection critic,
artifacts, scheduler daemon, 47 role specialists, three power tools, plus
the v1.0.6 extras — CostRouter, non-blocking ask_user_async, vision
feedback on computer_use, Docker-sandboxed code_execution, and
specialist tool presets upgraded to include run_code + ask_user_async.
863 tests. See the changelog for the full list.
Tier 2 — differentiation
E · Evaluation harness (shipit bench)
Why it's worth doing. Every agent change — a new prompt, a bigger model, a revised tool — is evaluated on vibes today. An eval harness turns "did this change break the agent?" into a number that CI can enforce.
Public surface (sketch)
from shipit_agent.bench import GoalSuite, BenchCase, run_bench
suite = GoalSuite(
cases=[BenchCase(
id="python-gil-explainer",
goal=Goal(
objective="Explain the Python GIL with a runnable snippet.",
success_criteria=["Two paragraphs", "A snippet that shows GIL behavior"],
),
expected_status="completed",
max_seconds=180,
),
# ... 30–50 cases covering your real workloads],
)
report = run_bench(
suite,
llm=llm,
autopilot_kwargs=dict(critic=True, artifacts=True),
max_parallel=8,
)
# report.pass_rate → 0.92
# report.diff_against("baseline-v1.0.6") → per-case regression listDesign notes
- A case passes when
AutopilotResult.status in ("completed", "partial")AND every criterion incriteria_metisTrueAND the critic's confidence (when configured) meets the suite's threshold. - Suite is versioned. Each
run_benchwrites a.jsonresults file with{suite_version, commit_sha, total, passed, per_case}.diff_against(...)loads a prior run and returns the set of cases that flipped pass→fail. - Parallelism via
Autopilot.fanoutunder the hood — the harness is itself a use case for fan-out, which keeps the runtime honest. - CLI:
shipit bench --suite path/to/suite.json+shipit bench diff <a> <b>.
Estimated scope. ~400 LOC across shipit_agent/bench/{__init__.py, suite.py, runner.py, diff.py}. ~25 tests (pure-stub LLM, deterministic). One new
notebook + a bench.md doc page.
Tradeoff to know. A bench that runs full Autopilot loops costs real money
per CI invocation. Recommend ship it with an optional --cheap mode that
forces the classifier-based CostRouter to always-easy, so cases validate
plumbing without burning budget. Separate full-fidelity nightly runs on the
real tiers.
F · Semantic skill retrieval
Why. 79 skills currently live in .shipit/skills/ and are matched by
keyword lookup. The right skill for a given prompt is often not the one with
the best keyword overlap. Embedding-based retrieval should triple the
right-skill-surfaced rate on ambiguous prompts.
Public surface (sketch)
from shipit_agent.skills import SemanticSkillIndex
index = SemanticSkillIndex.build_from_dir(
".shipit/skills/",
embedder=HashingEmbedder(dimension=512), # same pluggable interface Super RAG uses
)
relevant = index.search("debug why the flake rate spiked", top_k=3)
# relevant == [# Skill(id="root-cause-analysis", score=0.84, ...),
# Skill(id="flake-triage", score=0.79, ...),
# Skill(id="bisect-recipe", score=0.66, ...),
#]
# Autopilot integration — one kwarg.
autopilot = Autopilot(
llm=llm, goal=Goal(...),
skill_index=index,
skill_top_k=3, # inject the top-3 matched skill prompts each iteration
)Design notes
- Reuse the Super RAG
Embedder/VectorStore/Rerankerprotocols that already ship. Zero new dependencies in the default path (the hashing embedder is deterministic and cheap). - Index lives on disk as a single
.shipit/skills/index.json— rebuilt automatically when any<skill>.mdchanges (hash check on startup). - Agent injection: the top-k skills'
promptsections get appended to the system prompt with HTML markers — same pattern the skills pipeline uses today.
Estimated scope. ~350 LOC. ~20 tests. One notebook. Possibly a CLI
shipit skills search "<query>" for debugging.
Tradeoff. A heavier embedder (sentence-transformers) gives much better retrieval but adds a 120 MB install footprint. Default should remain the hashing embedder; the stronger model is opt-in.
H · MCP auto-discovery
Why. Every new shipit_agent user spends 10 minutes wiring the same three
MCP servers — @modelcontextprotocol/server-playwright,
server-filesystem, server-github. Auto-discovery would detect which of
those are installed on the machine and connect them on startup.
Public surface (sketch)
from shipit_agent.mcp import autodiscover
mcps = autodiscover(
# Optional allowlist — only attach these if they're installed.
prefer=["playwright", "filesystem", "github"],
# Optional blocklist — skip even if installed.
skip=["slack"],
)
agent = Agent.with_builtins(llm=llm, mcps=mcps)Design notes
- Probe via
npx -p <pkg> --help/which <binary>— detect without invoking. Cache the detection result per session. - Emit a
mcp.autodiscoveredevent per attached server so the live renderer can surface "attached GitHub MCP from/usr/local/bin". - Respect
.shipit/mcp.ymlwhen present — explicit config always wins.
Estimated scope. ~200 LOC. Tests stubbed against a fake subprocess.
Minor doc page update (mcp/auto-discovery.md).
Tradeoff. Auto-attaching anything found on PATH surprises users. Ship
this behind an opt-in Agent.with_builtins(mcps="auto") — never default-on.
Tier 3 — bigger projects
I · Distributed worker pool
Why. At some scale a single host isn't enough — 200-PR reviews, multi-tenant fleets, teams running overnight autopilots. A Redis- or NATS-backed scheduler daemon that farms goals across multiple worker machines addresses that.
Sketch. SchedulerDaemon(broker="redis://..."), workers register under
a consumer group, goal-queue entries are claimed atomically. Results
surface to the originating host via pub/sub.
Scope. ~1500 LOC (broker adapters, claim protocol, heartbeat, fail-over). Real infra tests. Likely two new doc pages.
Recommendation. Premature for most users. Revisit after we have three real-world users who hit the single-host ceiling.
J · Voice mode
Why. Conversational agents. Whisper for input, ElevenLabs (or Apple's built-in TTS) for output.
Sketch. shipit chat --voice opens an audio loop. Each turn is
transcribed → sent to the Autopilot → response streamed through TTS with
sentence-level chunking.
Scope. ~600 LOC. External service optional; --voice falls back to
text-only when the dependencies aren't installed.
Tradeoff. Voice is genuinely cool but is a rabbit hole (latency, interruption handling, turn-taking heuristics). Only build if there's concrete user demand.
K · Agent marketplace bundle format
Why. Today a specialist lives in agents/agents.json, a skill in
.shipit/skills/, a custom tool in a Python file. A user who wants to
share all three as one artifact has to do it manually. .shipitpack would
bundle them.
Sketch.
my-bundle.shipitpack/
├── manifest.yml # name, version, author, license
├── agents/<id>.json # optional specialist definitions
├── skills/<id>.md # optional skill files
├── tools/<id>/*.py # optional Python tool modules (single file)
└── README.mdshipit pack install ./my-bundle.shipitpack
shipit pack publish ./my-bundle.shipitpack --to https://pack.shipit.dev/Scope. ~500 LOC across shipit_agent/pack/. Optional registry server
is a separate slice.
Tradeoff. Tool modules are Python — installing a pack runs arbitrary
code. Must ship with signature verification + an opt-in --trust <author>
flag, and a clear "unsigned" warning. Skipping those shortcuts a security
hole.
L · Finetune-friendly trace export
Why. Trace data from real Autopilot runs is gold for SFT / DPO datasets. Today it's locked in the run's checkpoints + events; exporting into a standard format opens the door for anyone fine-tuning smaller models on the same tasks.
Sketch.
from shipit_agent.trace_export import export_sft
export_sft(
runs_dir="~/.shipit_agent/checkpoints",
output="training-data.jsonl",
format="chatml", # or "alpaca" / "sharegpt"
min_status="completed", # filter out halted / failed runs
)
# → JSONL file: one row per completed turn, with prompt + chosen answer.Mirror for DPO format — pair accepted answer (the real run's output) with rejected answer (the critic's worst-scoring iteration).
Scope. ~400 LOC. Some nuance in stripping tool-specific metadata that shouldn't leak into training data (secrets, file paths).
Tradeoff. Publishing training data is a legal surface — mention that
upfront. Ship with a --redact flag that scrubs file paths, env vars, and
credentials by default.
Guiding principles for anything new
- Composability over novelty. Every v1.0.6 feature is a drop-in —
Autopilot(critic=..., artifacts=..., tools=[...])works because each subsystem stays independent. New features must hold that line. - Budget-safe by default. Every new piece that spends tokens must
plug into
BudgetPolicyandSpendReportso 24h runs don't surprise anyone. - Observable by default. If it happens at runtime, emit an
autopilot.*stream event. Don't add a subsystem that only shows its work in logs. - ≤300 lines per file. Slice by responsibility until every file reads in one sitting. The whole codebase follows this rule.
- No silent imports. Every new LLM adapter, tool, specialist, skill is opt-in. The base install never downloads a model or imports an SDK a user didn't ask for.
Decision log — features discussed and NOT chosen
| Feature | Why skipped |
|---|---|
| In-library vision inference (run a local VLM to interpret screenshots) | Adds a model dependency + multi-GB install. Better to keep vision as base64 handoff to whatever LLM the user already picked. |
Single-file Agent.chat() wrapping the REPL + Autopilot | The REPL (shipit chat) and Autopilot target different use cases (interactive vs unattended). Conflating them would muddle both. |
| Automatic tool generation from OpenAPI specs | Tried as a prototype; the generated tools were too generic to be useful without hand-tuning. Better invested in richer connector_base. |
| Real-time collaborative runs (two users editing the same run) | No user has asked for it. Revisit only with a concrete use case. |
How to contribute
- Pick an item, open an issue with the sketch filled in.
- Branch naming:
roadmap/<id>-<short-name>, e.g.roadmap/E-bench-harness. - One PR per tier item; keep within the LOC estimate or split.
- Every new feature ships with: tests, at least one notebook cell, and a doc page that matches the existing section style.