Computer use (desktop)
Drive the local desktop — screenshots, clicks, typing, key chords. Platform-aware backends for macOS (cliclick / osascript), Linux (xdotool / scrot), and Windows (PowerShell).
Claude-Desktop-style desktop automation. Capture screenshots, move the
cursor, click, type, press key chords — on macOS, Linux, or Windows.
When the target has no API, CLI, or web path, computer_use is the
reach of last resort.
TL;DR —
ComputerUseTool().run(ctx, action="screenshot")writes a PNG to~/.shipit_agent/computer_use/. Other actions (click,type,drag,key,scroll,wait) drive the active desktop.
Backends
| Platform | Primitives | Install |
|---|---|---|
| macOS | screencapture + cliclick (with AppleScript fallback) | brew install cliclick |
| Linux | scrot or import (ImageMagick) + xdotool | apt install scrot xdotool |
| Windows | PowerShell (System.Windows.Forms) | built-in |
Missing dependency? The tool returns an explicit install hint instead
of silently failing — Error: cliclick required for drag (install: brew install cliclick).
Actions
| Action | Required params | What it does |
|---|---|---|
screenshot / capture | — | PNG to ~/.shipit_agent/computer_use/shot-<ts>.png |
mouse_move / move | x, y | Move cursor |
click | x, y | Optional button (left/right/middle), double (bool) |
drag | x, y, to_x, to_y | Press-move-release |
type | text | Type a string |
key | keys (e.g. "cmd+shift+4") | Press a key chord |
scroll | x, y, dx, dy | Scroll at a position |
wait | seconds | Sleep (max 60s) between actions for animations |
Mouse coordinates are top-left-origin pixel-absolute. On a Retina display you pass logical pixels (1/2 of the raw resolution).
Quickstart
from shipit_agent.tools.computer_use import ComputerUseTool
from shipit_agent.tools.base import ToolContext
computer = ComputerUseTool()
ctx = ToolContext(prompt="demo")
# 1. Take a screenshot
out = computer.run(ctx, action="screenshot")
print(out.metadata["path"]) # /Users/you/.shipit_agent/computer_use/shot-…png
# 2. Click at coordinates
computer.run(ctx, action="click", x=640, y=480)
# 3. Type text
computer.run(ctx, action="type", text="hello from shipit_agent")
# 4. Press a chord
computer.run(ctx, action="key", keys="cmd+shift+4")
# 5. Scroll down a page
computer.run(ctx, action="key", keys="pagedown")Recommended workflow
- Screenshot first. The agent doesn't know where to click until it sees the screen. Re-send the PNG through your LLM's vision API so the model can reason over pixels.
- Prefer keyboard shortcuts. Layout-independent, faster, more reliable than mouse clicks.
- Use
computer_useas the LAST RESORT. For web tasks, reach forplaywright_browser. For files, useread_file/write_file. - Wrap long flows in an Autopilot so budgets and checkpoints protect against stuck GUI states.
Auto-screenshot mode
If you want every action to be followed by a fresh screenshot — handy
for verification chains — pass auto_screenshot_after_action=True:
computer = ComputerUseTool(auto_screenshot_after_action=True)
out = computer.run(ctx, action="click", x=100, y=200)
print(out.metadata["screenshot"]) # path to the fresh PNGUsing it with a specialist
The design-reviewer specialist pairs well with computer_use for UI
verification runs:
from shipit_agent import Agent
from shipit_agent.agents import AgentRegistry
from shipit_agent.tools.computer_use import ComputerUseTool
registry = AgentRegistry()
def_ = registry.get("design-reviewer")
ui_agent = Agent(
llm=llm,
prompt=def_.prompt,
tools=[ComputerUseTool()],
max_iterations=8,
name=def_.name,
)
ui_agent.run("Open the Preferences pane, enable dark mode, screenshot the final state.")Vision feedback — the screenshot's PNG bytes travel with the result
Every screenshot action now embeds the base64-encoded PNG in the
tool output's metadata so a vision-capable LLM can actually see what
got captured — not just read a file path.
out = computer.run(ctx, action="screenshot")
out.metadata["vision"] # True
out.metadata["media_type"] # "image/png"
out.metadata["image_base64"] # "iVBORw0KGgoAAAANS..." ← feed this into the next LLM call
out.metadata["path"] # "/Users/you/.shipit_agent/computer_use/shot-…png"Opt out for large captures (saves tokens when you only need the path):
computer.run(ctx, action="screenshot", vision=False)Guardrails:
- Images over 4 MB auto-skip the embed —
metadata["vision"] == Falsewith avision_skip_reasontelling you why. - Failures to read the PNG from disk (very rare) don't raise — they set
vision=Falseand include the error invision_skip_reason.
This lands the missing piece for real GUI automation: before 1.0.6, an agent captured screens it couldn't interpret. Now it can.
Output directory
- Default:
~/.shipit_agent/computer_use/ - Override:
ComputerUseTool(output_dir="/tmp/shots") - One PNG per
screenshotaction; never deletes.
The directory is created on first use.
Notebook
notebooks/42_power_tools_computer_use_hubspot_research.ipynb—computer_usealongside the other power tools.notebooks/44_complete_tour.ipynb— cross-tool example with Autopilot.