PDF extraction

Extract text, per-page content, or metadata from a PDF (local path or URL). Lazy pypdf import, page-range syntax, char budget for extract_text.

3 min read
13 sections
Edit this page

Pull structured text out of PDFs — contracts, research papers, reports, invoices — without shelling out to an external service. Works on local paths and http(s):// URLs; the URL branch fetches into memory with a 30 s default timeout. pypdf is imported lazily, so the base install stays slim and the tool reports pypdf_missing if the extra wasn't installed.

TL;DRPDFTool().run(ctx, action="extract_text", source="./contract.pdf", pages="1-5") returns up to 50 000 characters of joined page text in output.text with metadata.truncated set when the cap was hit.

When to use

  • Contract / legal reviewextract_text on a range of pages, then feed the result into a summariser.
  • Research / Q&A with citationsextract_pages keeps each page's text addressable by page number.
  • Invoice / receipt ingestionmetadata for title / author / dates, extract_pages for line items.
  • Cheap preflight before committing tokens to a 400-page PDF — page_count answers "how long is this?" in one call.

Setup

PDFTool is not a connector — no credentials. The only prerequisite is pypdf, which is gated behind an extra:

bash
pip install 'shipit-agent[pdf]'        # installs pypdf
python
from shipit_agent.tools.pdf import PDFTool

pdf = PDFTool(
    max_chars_default=50_000,   # cap for extract_text
    timeout_seconds=30,         # URL fetch timeout
)

If you call pdf.run() before installing the extra, the tool returns a pypdf_missing error with the install hint — it never imports the dependency at module load time.

Quick example

python
from shipit_agent.tools.base import ToolContext
from shipit_agent.tools.pdf import PDFTool

pdf = PDFTool()
ctx = ToolContext(prompt="demo")

# Extract the first 5 pages, capped at 10k chars
out = pdf.run(ctx,
              action="extract_text",
              source="./paper.pdf",
              pages="1-5",
              max_chars=10_000)
print(out.text[:500])
print(out.metadata["truncated"], out.metadata["char_count"])

Page range syntax (pages parameter)

text
"1-3,5,7-9"      → pages 1, 2, 3, 5, 7, 8, 9
  "1,3,5"          → pages 1, 3, 5
  "2-4"            → pages 2, 3, 4
  ""   or omitted  → all pages

Out-of-range pages are silently dropped once pypdf has reported the true page count. Inverted ranges ("9-5") are auto-swapped. Whitespace is tolerated around numbers and commas.

As a tool handed to an agent

python
from shipit_agent import Agent
from shipit_agent.llms import OpenAIChatLLM
from shipit_agent.tools.pdf import PDFTool

agent = Agent(
    llm=OpenAIChatLLM(model="gpt-4o-mini"),
    tools=[PDFTool()],
    prompt=(
        "You are a contract reviewer. Use the pdf tool to extract the "
        "relevant pages, then summarise obligations, termination "
        "clauses, and payment terms."
    ),
)
agent.run("Summarise the termination clauses in ./nda.pdf (pages 3-7).")

Actions

Four actions live on a single action enum.

extract_text

Required: source. Optional: pages (page range spec), max_chars (defaults to PDFTool.max_chars_default, i.e. 50 000).

Returns joined text ("\n\n" between pages) in output.text. Truncation appends "\n…(truncated)" and sets metadata.truncated = True. Set max_chars=0 to disable the cap.

python
pdf.run(ctx, action="extract_text",
        source="https://example.com/whitepaper.pdf",
        pages="1-10", max_chars=20_000)

extract_pages

Required: source. Optional: pages. Returns a structured metadata.pages list of {"page": <1-indexed>, "text": "..."} dicts — use this when citations need to point at a specific page number. Per-page extraction errors don't fail the action; the failing page gets an "error" key in its entry and an empty "text".

metadata

Required: source. Returns title / author / subject / creator / producer / creation_date / modification_date / page_count. Keys are null when the PDF doesn't set them. The tool reads both raw PDF keys ("/Title") and their attribute aliases.

python
pdf.run(ctx, action="metadata", source="./report.pdf")

page_count

Required: source. Cheapest possible action — parses the PDF and returns "N pages." plus metadata.page_count.

source resolution

  • http://... / https://... — fetched via urllib.request with timeout_seconds (default 30) and OSError / URLError / TimeoutError surfaced as url_fetch_failed.
  • Local path~ is expanded; the path must exist and be a regular file.
  • Anything else is rejected as file_not_found.

Downloads happen into memory — don't use this tool to process a 500 MB PDF; route those through an object-store pipeline instead.

Error shapes

error=MeaningWhat to do
unknown_actionaction= not in the enumUse extract_text / extract_pages / metadata / page_count.
missing_sourcesource= was emptyProvide a path or URL.
pypdf_missingpypdf isn't installedpip install 'shipit-agent[pdf]'.
url_fetch_failedHTTP / URL / timeout error fetching a remote PDFmetadata.message has the underlying error.
file_not_foundLocal path didn't exist or wasn't a fileCheck the path.
file_read_failedOSError reading local bytesPermissions / disk issue.
pdf_parse_errorpypdf.PdfReader(stream) raisedThe file is corrupt or not actually a PDF. metadata.message has the underlying error.
  • Tool catalog — every built-in tool.
  • vision — for PDFs that are scanned images rather than text; render pages as images and route through vision.
  • open_url — for fetching HTML; pdf is the right tool only when you want parsed PDF text.
  • Specialists — researcher / legal review roles that chain PDF extraction with summarisation.