Roadmap

Name: SHIPIT Agent
Author: SHIPIT

The shipit_agent roadmap after v1.0.6 — evaluation harness, semantic skill retrieval, MCP auto-discovery, distributed worker pool, voice mode, agent marketplace, and fine-tune-friendly trace export. Each item with scope, estimated LOC, design notes, and tradeoffs.

7 min read

12 sections

Edit this page

What's next for shipit_agent. Each item below has concrete scope, an estimated LOC budget, a sketch of the public API, and the honest tradeoff a reader should know about before picking it up.

The items are grouped by tier the same way the in-session gap audit used:

Tier 2 — differentiation. High daily-impact features that most production users will reach for once they exist.
Tier 3 — bigger projects. Nice-to-have or niche; worth waiting for real demand before building.

Nothing here is committed. This document is a live record of what has been discussed but not yet built, so contributors can pick up any item without having to reconstruct the context.

Where v1.0.6 ended

Recap of what shipped: the Autopilot runtime, fan-out, reflection critic, artifacts, scheduler daemon, 47 role specialists, three power tools, plus the v1.0.6 extras — CostRouter, non-blocking ask_user_async, vision feedback on computer_use, Docker-sandboxed code_execution, and specialist tool presets upgraded to include run_code + ask_user_async. 863 tests. See the changelog for the full list.

Tier 2 — differentiation

E · Evaluation harness (`shipit bench`)

Why it's worth doing. Every agent change — a new prompt, a bigger model, a revised tool — is evaluated on vibes today. An eval harness turns "did this change break the agent?" into a number that CI can enforce.

Public surface (sketch)

python

from shipit_agent.bench import GoalSuite, BenchCase, run_bench

suite = GoalSuite(
    cases=[BenchCase(
            id="python-gil-explainer",
            goal=Goal(
                objective="Explain the Python GIL with a runnable snippet.",
                success_criteria=["Two paragraphs", "A snippet that shows GIL behavior"],
            ),
            expected_status="completed",
            max_seconds=180,
        ),
        # ... 30–50 cases covering your real workloads],
)

report = run_bench(
    suite,
    llm=llm,
    autopilot_kwargs=dict(critic=True, artifacts=True),
    max_parallel=8,
)
# report.pass_rate → 0.92
# report.diff_against("baseline-v1.0.6")  →  per-case regression list

Design notes

A case passes when AutopilotResult.status in ("completed", "partial") AND every criterion in criteria_met is True AND the critic's confidence (when configured) meets the suite's threshold.
Suite is versioned. Each run_bench writes a .json results file with {suite_version, commit_sha, total, passed, per_case}. diff_against(...) loads a prior run and returns the set of cases that flipped pass→fail.
Parallelism via Autopilot.fanout under the hood — the harness is itself a use case for fan-out, which keeps the runtime honest.
CLI: shipit bench --suite path/to/suite.json + shipit bench diff <a> <b>.

Estimated scope. ~400 LOC across shipit_agent/bench/{__init__.py, suite.py, runner.py, diff.py}. ~25 tests (pure-stub LLM, deterministic). One new notebook + a bench.md doc page.

Tradeoff to know. A bench that runs full Autopilot loops costs real money per CI invocation. Recommend ship it with an optional --cheap mode that forces the classifier-based CostRouter to always-easy, so cases validate plumbing without burning budget. Separate full-fidelity nightly runs on the real tiers.

F · Semantic skill retrieval

Why. 79 skills currently live in .shipit/skills/ and are matched by keyword lookup. The right skill for a given prompt is often not the one with the best keyword overlap. Embedding-based retrieval should triple the right-skill-surfaced rate on ambiguous prompts.

Public surface (sketch)

python

from shipit_agent.skills import SemanticSkillIndex

index = SemanticSkillIndex.build_from_dir(
    ".shipit/skills/",
    embedder=HashingEmbedder(dimension=512),   # same pluggable interface Super RAG uses
)
relevant = index.search("debug why the flake rate spiked", top_k=3)
# relevant == [#   Skill(id="root-cause-analysis", score=0.84, ...),
#   Skill(id="flake-triage", score=0.79, ...),
#   Skill(id="bisect-recipe", score=0.66, ...),
#]

# Autopilot integration — one kwarg.
autopilot = Autopilot(
    llm=llm, goal=Goal(...),
    skill_index=index,
    skill_top_k=3,             # inject the top-3 matched skill prompts each iteration
)

Design notes

Reuse the Super RAG Embedder / VectorStore / Reranker protocols that already ship. Zero new dependencies in the default path (the hashing embedder is deterministic and cheap).
Index lives on disk as a single .shipit/skills/index.json — rebuilt automatically when any <skill>.md changes (hash check on startup).
Agent injection: the top-k skills' prompt sections get appended to the system prompt with HTML markers — same pattern the skills pipeline uses today.

Estimated scope. ~350 LOC. ~20 tests. One notebook. Possibly a CLI shipit skills search "<query>" for debugging.

Tradeoff. A heavier embedder (sentence-transformers) gives much better retrieval but adds a 120 MB install footprint. Default should remain the hashing embedder; the stronger model is opt-in.

H · MCP auto-discovery

Why. Every new shipit_agent user spends 10 minutes wiring the same three MCP servers — @modelcontextprotocol/server-playwright, server-filesystem, server-github. Auto-discovery would detect which of those are installed on the machine and connect them on startup.

Public surface (sketch)

python

from shipit_agent.mcp import autodiscover

mcps = autodiscover(
    # Optional allowlist — only attach these if they're installed.
    prefer=["playwright", "filesystem", "github"],
    # Optional blocklist — skip even if installed.
    skip=["slack"],
)
agent = Agent.with_builtins(llm=llm, mcps=mcps)

Design notes

Probe via npx -p <pkg> --help / which <binary> — detect without invoking. Cache the detection result per session.
Emit a mcp.autodiscovered event per attached server so the live renderer can surface "attached GitHub MCP from /usr/local/bin".
Respect .shipit/mcp.yml when present — explicit config always wins.

Estimated scope. ~200 LOC. Tests stubbed against a fake subprocess. Minor doc page update (mcp/auto-discovery.md).

Tradeoff. Auto-attaching anything found on PATH surprises users. Ship this behind an opt-in Agent.with_builtins(mcps="auto") — never default-on.

Tier 3 — bigger projects

I · Distributed worker pool

Why. At some scale a single host isn't enough — 200-PR reviews, multi-tenant fleets, teams running overnight autopilots. A Redis- or NATS-backed scheduler daemon that farms goals across multiple worker machines addresses that.

Sketch. SchedulerDaemon(broker="redis://..."), workers register under a consumer group, goal-queue entries are claimed atomically. Results surface to the originating host via pub/sub.

Scope. ~1500 LOC (broker adapters, claim protocol, heartbeat, fail-over). Real infra tests. Likely two new doc pages.

Recommendation. Premature for most users. Revisit after we have three real-world users who hit the single-host ceiling.

J · Voice mode

Why. Conversational agents. Whisper for input, ElevenLabs (or Apple's built-in TTS) for output.

Sketch. shipit chat --voice opens an audio loop. Each turn is transcribed → sent to the Autopilot → response streamed through TTS with sentence-level chunking.

Scope. ~600 LOC. External service optional; --voice falls back to text-only when the dependencies aren't installed.

Tradeoff. Voice is genuinely cool but is a rabbit hole (latency, interruption handling, turn-taking heuristics). Only build if there's concrete user demand.

K · Agent marketplace bundle format

Why. Today a specialist lives in agents/agents.json, a skill in .shipit/skills/, a custom tool in a Python file. A user who wants to share all three as one artifact has to do it manually. .shipitpack would bundle them.

Sketch.

bash

my-bundle.shipitpack/
├── manifest.yml              # name, version, author, license
├── agents/<id>.json          # optional specialist definitions
├── skills/<id>.md            # optional skill files
├── tools/<id>/*.py           # optional Python tool modules (single file)
└── README.md

bash

shipit pack install ./my-bundle.shipitpack
shipit pack publish ./my-bundle.shipitpack --to https://pack.shipit.dev/

Scope. ~500 LOC across shipit_agent/pack/. Optional registry server is a separate slice.

Tradeoff. Tool modules are Python — installing a pack runs arbitrary code. Must ship with signature verification + an opt-in --trust <author> flag, and a clear "unsigned" warning. Skipping those shortcuts a security hole.

L · Finetune-friendly trace export

Why. Trace data from real Autopilot runs is gold for SFT / DPO datasets. Today it's locked in the run's checkpoints + events; exporting into a standard format opens the door for anyone fine-tuning smaller models on the same tasks.

Sketch.

python

from shipit_agent.trace_export import export_sft

export_sft(
    runs_dir="~/.shipit_agent/checkpoints",
    output="training-data.jsonl",
    format="chatml",                    # or "alpaca" / "sharegpt"
    min_status="completed",             # filter out halted / failed runs
)
# → JSONL file: one row per completed turn, with prompt + chosen answer.

Mirror for DPO format — pair accepted answer (the real run's output) with rejected answer (the critic's worst-scoring iteration).

Scope. ~400 LOC. Some nuance in stripping tool-specific metadata that shouldn't leak into training data (secrets, file paths).

Tradeoff. Publishing training data is a legal surface — mention that upfront. Ship with a --redact flag that scrubs file paths, env vars, and credentials by default.

Guiding principles for anything new

Composability over novelty. Every v1.0.6 feature is a drop-in — Autopilot(critic=..., artifacts=..., tools=[...]) works because each subsystem stays independent. New features must hold that line.
Budget-safe by default. Every new piece that spends tokens must plug into BudgetPolicy and SpendReport so 24h runs don't surprise anyone.
Observable by default. If it happens at runtime, emit an autopilot.* stream event. Don't add a subsystem that only shows its work in logs.
≤300 lines per file. Slice by responsibility until every file reads in one sitting. The whole codebase follows this rule.
No silent imports. Every new LLM adapter, tool, specialist, skill is opt-in. The base install never downloads a model or imports an SDK a user didn't ask for.

Decision log — features discussed and NOT chosen

Feature	Why skipped
In-library vision inference (run a local VLM to interpret screenshots)	Adds a model dependency + multi-GB install. Better to keep vision as base64 handoff to whatever LLM the user already picked.
Single-file `Agent.chat()` wrapping the REPL + Autopilot	The REPL (`shipit chat`) and Autopilot target different use cases (interactive vs unattended). Conflating them would muddle both.
Automatic tool generation from OpenAPI specs	Tried as a prototype; the generated tools were too generic to be useful without hand-tuning. Better invested in richer `connector_base`.
Real-time collaborative runs (two users editing the same run)	No user has asked for it. Revisit only with a concrete use case.

How to contribute

Pick an item, open an issue with the sketch filled in.
Branch naming: roadmap/<id>-<short-name>, e.g. roadmap/E-bench-harness.
One PR per tier item; keep within the LOC estimate or split.
Every new feature ships with: tests, at least one notebook cell, and a doc page that matches the existing section style.

Tier 2 — differentiation

E · Evaluation harness (shipit bench)

F · Semantic skill retrieval

H · MCP auto-discovery

Tier 3 — bigger projects

I · Distributed worker pool

J · Voice mode

K · Agent marketplace bundle format

L · Finetune-friendly trace export

Guiding principles for anything new

Decision log — features discussed and NOT chosen

How to contribute

E · Evaluation harness (`shipit bench`)