Overview

The long-running, goal-driven, budget-gated runtime. Runs until every success criterion is met — or a budget trips. Streams every iteration live, checkpoints every step, resumes cleanly after a crash.

4 min read
9 sections
Edit this page

Autopilot is the long-running runtime that turns any shipit_agent Agent or GoalAgent into a Claude-Desktop-style autonomous runner: goal-driven, budget-gated, checkpointed, and streaming every iteration as it happens.

TL;DRAutopilot(llm=llm, goal=Goal(...), budget=BudgetPolicy(...)).run(run_id="my-run") keeps the inner agent looping until every success criterion is satisfied or a budget trip halts it cleanly. Add critic=True, artifacts=True, or autopilot.fanout(...) to go further.


What Autopilot adds on top of GoalAgent

FeatureGoalAgentAutopilot
TerminationFixed max_steps (20 default)Criteria-satisfaction OR budget trip
BudgetStep count onlyWall-clock · tool calls · tokens · dollars · iters
Crash safetyNoneAtomic JSON checkpoint every iteration
Resume after crashNot supportedautopilot.resume(run_id) — picks up where it left off
Telemetryevents list on resultautopilot.stream() — live event iterator
ReflectionManualOptional built-in Critic loop
DeliverablesSingle output blobArtifactCollector — code / markdown / files
Parallel batchesManual threadingautopilot.fanout(items, template) — built-in
24-hour resilienceNot designed for itCumulative budgets, SIGTERM handler, corrupt-checkpoint quarantine

Bulletproof for 24-hour runs

Autopilot is designed for runs that outlive their host process. The stack survives crashes, VM restarts, systemd stop, and disk corruption:

  • Cumulative budgets across resume. Every field of BudgetUsage — seconds, tool calls, tokens, dollars, iterations — persists in the checkpoint. A run that crashes at hour 12 and resumes for another 12 hours trips its 24-hour cap exactly at hour 24, not hour 36.
  • Dollar tracking from provider usage. Token counts from the LLM response flow through the built-in pricing table (with Bedrock / LiteLLM prefix handling) so max_dollars actually fires instead of always reading $0.
  • Graceful SIGTERM / SIGHUP. systemd and launchd send SIGTERM on service stop. Autopilot catches it alongside SIGINT, halts at the next iteration boundary, saves a final checkpoint, and sets halt_reason="halted by SIGTERM".
  • Corrupt-checkpoint quarantine. A JSON parse error during load() renames the bad file to <run_id>.corrupted.<timestamp>.json instead of silently dropping it, so operators can inspect what went wrong.
  • First-iteration heartbeat. The run emits a heartbeat on iteration 1 even before the regular interval elapses, so a slow first step never looks like a hang.
  • remaining budget on every event. Stream iteration + heartbeat events carry per-axis headroom so a UI can render ETA bars.
  • Thread-safe external stop. autopilot.request_stop(reason) halts the loop cleanly from a daemon, a UI, or a signal handler.

Test coverage for these scenarios lives in tests/test_autopilot_hardening.py and tests/test_autopilot_long_task.py, plus an opt-in soak gated on SHIPIT_AUTOPILOT_SOAK=<seconds> that drives a real Bedrock LLM for the requested duration.


Quick start

python
from shipit_agent import Autopilot, BudgetPolicy, Goal
from examples.run_multi_tool_agent import build_llm_from_env

llm = build_llm_from_env("bedrock")     # defaults to Llama 4 Scout

autopilot = Autopilot(
    llm=llm,
    goal=Goal(
        objective="Summarize three Python dict gotchas with runnable snippets.",
        success_criteria=["At least 3 gotchas explained in prose",
            "Each gotcha has a Python snippet that reproduces it",
            "A one-line avoidance tip is shown for each",],
    ),
    budget=BudgetPolicy(
        max_seconds=600,                 # 10 min cap
        max_tool_calls=20,
        max_iterations=8,
        max_tokens=300_000,
        max_dollars=2.0,
    ),
)

result = autopilot.run(run_id="py-dict-gotchas")
print(result.status)                    # "completed" | "partial" | "halted" | "failed"
print(result.iterations)
print(result.output[:500])

The BudgetPolicy

Every cap is independently honored. Set any axis to 0 or None to disable it.

CapDefaultWhen it matters
max_seconds1800.0 (30 min)Hard wall-clock ceiling — your 24h runs must opt in by raising this.
max_tool_calls100Prevents tool-call storms (a common LLM failure mode).
max_tokens2_000_000Coarse token meter. ~$10 on Claude Sonnet 4.5.
max_dollars5.0Catches runaway cost. You'll want to override this per run.
max_iterations200Safety net against an infinite loop.
python
# A conservative overnight run — 8h, ample tool budget, $20 cap.
BudgetPolicy(
    max_seconds=8 * 3600,
    max_tool_calls=2000,
    max_tokens=20_000_000,
    max_dollars=20.0,
    max_iterations=1000,
)

A budget trip produces status="halted" (no criterion verified) or status="partial" (some satisfied). The halt_reason field tells you which axis tripped.


Status values

StatusMeaning
completedEvery success criterion verified.
partialSome criteria satisfied, then halted (usually budget).
haltedBudget tripped before any criterion was verified.
failedUnhandled exception; the checkpoint captures the crash state.

Live streaming

python
for event in autopilot.stream(run_id="py-dict-gotchas"):
    kind = event["kind"]
    if kind == "autopilot.iteration":
        print(f"iter {event['iteration']} — "
              f"{sum(event['criteria_met'])}/{len(event['criteria_met'])} criteria")
    elif kind == "autopilot.result":
        print(f"final: {event['status']}")

Full list of event kinds — see Live streaming and the complete tour notebook 44_complete_tour.ipynb.


Persistence & resume

Autopilot writes an atomic JSON checkpoint after every iteration to ~/.shipit_agent/checkpoints/<run_id>.json. The next call with the same run_id refuses to overwrite unless resume=True is passed:

python
# First run — tiny budget trips, halts partial.
first = Autopilot(..., budget=BudgetPolicy(max_iterations=1)).run(run_id="r1")

# Restart with a bigger budget — picks up at iteration 2.
second = Autopilot(..., budget=BudgetPolicy(max_iterations=6)).resume("r1")
assert second.iterations > first.iterations

A crash during save cannot corrupt the checkpoint — we write to a .tmp sibling and replace() atomically.


What's in this section

  • Critic — reflection loop; terminate the moment criteria are confidently met.
  • Artifacts — structured deliverables (code, markdown, files) collected during the run.
  • Fan-out — dispatch N children in parallel with a budget-scaled slice each.
  • Scheduler daemon — persistent goal queue for 24h operation.
  • CLIshipit autopilot / shipit queue / shipit daemon reference.
  • Specialists — the 47 prebuilt role agents you can hand to any Autopilot.

Full runnable demo — notebook 37

Every code block on this page is exercised in notebooks/37_autopilot_quickstart.ipynb against Bedrock Llama 4. Open it when you want to see output, not just snippets.