Punk Docs - Onboarding Guide

Onboarding Guide

This is the extended guide for taking Punk from evaluation to first production traffic. If you only need a fast local walkthrough, start with Punk in 30 Minutes. Use this guide when you are planning a real pilot, a customer proof of value, or a team rollout.

Punk works best when the work is operational: repeated, evidence-bearing, policy-sensitive, or expensive enough that routing decisions matter. It is not primarily a generic prompt benchmark for one-off creative drafts or trivial questions.

Hosted reference: cheaperfastersafer.com. Local default: http://localhost:4100.

Who This Is For

Reader	What this guide helps you finish
Evaluator	Pick the right workflow, run a credible local proof, and know what evidence to inspect.
App developer	Route an existing OpenAI- or Anthropic-compatible app through Punk without rewriting the app.
Workflow builder	Convert repeated agent work into a workflow with inputs, gates, receipts, and replayable outputs.
Operator	Set up auth, storage, workers, provider keys, health checks, retention, and billing posture.
Security/GRC reviewer	Understand identity, side effects, approval gates, redaction, audit, and private-network controls.
Team admin	Invite users, issue scoped keys, and make the dashboard useful for a pilot team.

Outcomes

By the end of onboarding, you should have:

A local Punk gateway and dashboard.
One real workflow candidate selected for evaluation.
At least one observed agent/app routed through Punk.
Traceable runs with app, agent, and subject identity.
A governance posture for read-only, reversible, user-visible, and high-impact actions.
An optimization-evidence view that shows whether Punk found a stable pattern.
A decision on whether the pilot remains observe-only or starts optimized routing.
A production readiness checklist for storage, auth, workers, provider keys, and operations.

Phase 1: Choose The Right Work

Start with a workflow, not a random prompt. Punk proves value when a repeated job has structure that can be observed, governed, measured, and approved for optimization.

Good first candidates:

Candidate	Why it is a good fit
Support triage	Repeated classification, structured output, low-risk reads, clear evaluation criteria.
Vendor review	Web evidence, invoice or profile data, thresholds, approval gates, reusable scorecards.
Pricing monitor	Web reads, structured extraction, snapshots, diffs, repeatable schedule.
Lead enrichment	Web and CRM reads, field normalization, policy-controlled writes.
Compliance precheck	Evidence collection, deterministic gates, receipts, human approval before action.
Internal research brief	Repeatable source policy, citation/evidence burden, reusable templates.

Poor first candidates:

Candidate	Why to avoid it for the first proof
One-off creative writing	Subjective quality dominates; repeatability and route proof are weak.
Tiny factual questions	The baseline is already cheap and fast; savings will be uninteresting.
Unbounded brainstorming	Hard to define correctness, side effects, or replay evidence.
Fully manual workflows	Punk needs agent/app traffic to observe and improve.
High-impact writes on day one	Start in observe mode until policy and approval paths are clear.

Use the public Workflow Diagnostic before a pilot call or scoping session. It compares a standard serial agent loop with Punk workflow mode across repeatability, evidence burden, side-effect risk, governance gates, receipts, review value, cost, and latency. Treat it as a workflow-fit diagnostic, not a generic leaderboard.

Phase 2: Define The Pilot Contract

Before running traffic, write down the pilot contract in plain language.

Question	Example answer
What job are we evaluating?	"Review a new vendor and invoice, check the vendor site, flag spend over $5,000, and prepare a scorecard."
What input shape repeats?	Vendor URL, invoice PDF or extracted invoice fields, requester, department, spend amount.
What output shape matters?	JSON scorecard plus human-readable rationale and approval recommendation.
What evidence is required?	Vendor website snapshot, invoice fields, policy threshold, risk flags.
What actions are risky?	Emailing finance, creating a ticket, approving spend, storing vendor records.
What can be cached or promoted?	Stable extraction, policy threshold logic, scorecard shell, known vendor profile.
Who approves promotion?	Pilot operator or workflow owner.
What success metric matters?	Lower cost after proof, fewer hidden side effects, receipts for every action, less manual review.

Use consistent identifiers from the first run:

Identifier	Recommendation
Tenant	One tenant per company, team, or pilot customer.
App	Product surface or integration name, for example `finance-review-app`.
Agent	Stable actor name, for example `vendor-review-agent`.
Subject	The end user, account, customer, vendor, ticket, or workflow instance being acted on.

Punk uses these identifiers for trust, audit, policy, cost, pattern discovery, and routing. Missing identity makes the pilot harder to interpret.

Phase 3: Run Punk Locally

Install dependencies and start the gateway:

bun install
bun run dev

Open http://localhost:4100.

Default local behavior:

Area	Default
Port	`4100`
Database	`data/punk.db`
Provider	offline mock when no matching live provider key is configured
Auth	open dev mode when `PUNK_API_KEY` is unset
Worker	embedded in the API process
Learning	background tick enabled
Dashboard	served at `/`
Docs	served at `/docs`

If the dashboard is blank, use the Getting started panel to seed demo data. For repeatable optimization traffic, keep the gateway running and use another terminal:

bun run demo

Inspect the dashboard after the first demo run:

Dashboard area	What to verify
Overview	Recent activity, route mix, spend, savings, health.
Runs	Every model request has route, cost, latency, trace, and explanation.
Patterns	Repeated request shapes are grouped.
Artifacts	Candidate optimized routes show evidence, promotion, and rollback state.
Learning	Evidence notes explain what is eligible, blocked, or waiting for more samples.
Web	Compact page snapshots show structured page state and token savings.
Governance	Policies, users, keys, credentials, MCP servers, audit, approvals.
Workflows	Templates, graph editor, run panel, node timelines.

Local success criteria:

You can load /, /docs, and /health.
A chat, workflow, or demo run appears in Runs.
A repeated request can be recognized as a pattern.
The run detail explains route choice and cost.
The learning page explains whether a candidate is eligible or why it needs more evidence.

Phase 4: Connect One Real App In Observe Mode

The simplest integration is a base URL swap. Keep your existing OpenAI-compatible request shape, point it at Punk, and send identity headers.

curl http://localhost:4100/v1/chat/completions \
  -H 'content-type: application/json' \
  -H 'x-punk-app: finance-review-app' \
  -H 'x-punk-agent: vendor-review-agent' \
  -H 'x-punk-subject: vendor:acme-123' \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "Review this vendor profile and return risk, rationale, and next action."
      }
    ]
  }'

For Anthropic-compatible apps, use the native Messages endpoint:

curl http://localhost:4100/v1/messages \
  -H 'content-type: application/json' \
  -H 'x-punk-app: finance-review-app' \
  -H 'x-punk-agent: vendor-review-agent' \
  -H 'x-punk-subject: vendor:acme-123' \
  -d '{
    "model": "claude-haiku-4-5",
    "max_tokens": 512,
    "messages": [
      {
        "role": "user",
        "content": "Review this vendor profile and return risk, rationale, and next action."
      }
    ]
  }'

If you set PUNK_API_KEY, include bearer auth:

-H 'authorization: Bearer <token>'

Start in observe mode for consequential work. Observe mode records what policy would do without blocking live work. Use optimize mode only after the trace, governance, and promotion paths are understood.

What to inspect after the first real app run:

Evidence	Where
Model provider and key source	Run detail trace and route explanation.
Prompt shape and token use	Run detail.
App, agent, subject	Run detail and governance records.
Policy verdict	Run detail and Governance audit.
Cost and latency	Runs table and Overview.
Repeated pattern	Patterns and Learning after several similar requests.

Phase 5: Add Provider Keys Deliberately

With no live provider key, Punk uses the deterministic mock provider for local work. For live calls, configure platform keys or tenant BYOK.

Platform env keys:

Provider	Env vars
OpenAI	`OPENAI_API_KEY`, optional `OPENAI_BASE_URL`
Anthropic	`ANTHROPIC_API_KEY`, optional `ANTHROPIC_BASE_URL`
OpenRouter	`OPENROUTER_API_KEY`, optional `OPENROUTER_BASE_URL`
DeepSeek	`DEEPSEEK_API_KEY`, optional `DEEPSEEK_BASE_URL`
Kimi/Moonshot	`MOONSHOT_API_KEY` or `KIMI_API_KEY`, optional base URL

Tenant BYOK stores a tenant-owned provider key in the encrypted credentials vault. Set PUNK_ENCRYPTION_KEY before relying on stored credentials outside local dev.

Example tenant key:

curl -X POST http://localhost:4100/api/v1/credentials \
  -H 'content-type: application/json' \
  -H 'authorization: Bearer <token>' \
  -d '{
    "name": "openai",
    "provider": "openai",
    "secret": { "value": "sk-..." }
  }'

Read Configuration before mixing platform keys and tenant keys in production.

Phase 6: Classify Tools And Side Effects

Punk can govern direct model calls, SDK tool traces, workflow tool nodes, web sessions, and webhook effects. The important first step is classifying side effects.

Level	Meaning	Pilot posture
0	Pure computation	Safe to observe and optimize early.
1	Read-only external	Good first pilot scope.
2	Reversible or idempotent write	Require identity, idempotency, and audit.
3	User-visible write	Start observe-only; add approval rules before optimize.
4	High-impact write	Require explicit policy, approval, and rollback plan.

Undeclared SDK tools default to side-effect level 3. That is intentional: unknown tools are treated like user-visible writes.

For app code, use the TypeScript SDK when you need tool tracing, feedback, web fetch, or web sessions. Keep the first integration small:

Route model calls through Punk.
Add identity headers.
Add tool tracing around one or two important tools.
Mark tool side-effect levels.
Verify trace and governance events.
Expand only after the first path is observable.

Read SDK, API, and Governance for the exact client and HTTP surfaces.

Phase 7: Build The Workflow Version

Once the repeated job is visible, decide whether it should remain a chat/agent flow or become a workflow.

Surface	Use when
Gateway	You need a low-friction base URL swap for an existing agent.
Chat	A human is actively testing prompts and route behavior.
Agent	One scheduled or on-demand task can be represented as `start -> llm -> output`.
Workflow	The job has multiple steps, branches, tools, web reads, gates, or structured outputs.
Chorus	The job needs governed multi-model answers with evidence receipts.

Workflow design checklist:

Inputs are explicit JSON, not hidden in prose.
Every web or external read has a named step.
Risky actions are separate from reasoning steps.
Side effects have declared levels.
Gates are stated as policy or workflow conditions.
Outputs have a stable schema.
Receipts and evidence are preserved.
The workflow can be reviewed without firing real side effects.
The owner can explain what would be promoted and what must remain live.

Start from the dashboard templates:

Template	First use
`support-triage`	Ticket classification and conditional notification.
`web-research`	Web fetch plus model summary.
`pricing-monitor`	Scheduled web reads and structured extraction.

Run the workflow several times with similar inputs. Then inspect Runs, Patterns, Learning, and Artifacts to see whether Punk found a stable route.

Phase 8: Governance And Security Review

Do this before optimize mode or production exposure.

Access and identity:

Set PUNK_API_KEY for protected API and gateway routes.
Bootstrap dashboard users with PUNK_ADMIN_EMAIL, PUNK_ADMIN_PASSWORD, and optionally PUNK_REQUIRE_LOGIN=true.
Use tenant API keys for apps rather than sharing the bootstrap admin token.
Pin keys to app ids when possible.
Send X-Punk-App, X-Punk-Agent, and X-Punk-Subject.

Secrets and credentials:

Set PUNK_ENCRYPTION_KEY before storing provider keys, workflow credentials, or MCP credentials.
Store provider BYOK keys under Governance -> Provider keys or /api/v1/credentials.
Do not put secrets in prompts, workflow inputs, or trace-visible metadata.

Network controls:

Leave PUNK_ALLOW_PRIVATE_WEB_FETCH=false in authenticated deployments unless private fetches are intended.
Leave PUNK_ALLOW_PRIVATE_WEBHOOKS=false unless private webhook targets are intended.
Review web session and webhook destinations before enabling writes.

Policy and approvals:

Keep policies in PUNK_POLICIES_DIR.
Declare allow, deny, and approval-required rules for the pilot app and agent.
Require approval for side-effect levels 3 and 4 unless the workflow owner explicitly accepts the risk.
Use observe mode first to see what would have been blocked.
Review audit events and pending approvals in the dashboard.

Data controls:

Decide retention_days.
Enable tenant setting redaction=true when tool payloads may include sensitive fields.
Review streaming_dlp=true if secrets or regulated identifiers could leave the gateway.
Use tripwires for sensitive decoy values that should never appear in prompts or outputs.

Security review done means:

A non-admin app key exists.
The app key is scoped or pinned where possible.
Provider and tool credentials are encrypted.
Risky actions are gated.
Private-network escape hatches are intentionally set.
Retention and redaction are chosen.
The operator can find audit records for a run, approval, and policy decision.

Phase 9: Evidence And Promotion

Punk does not promote a cheaper route just because it was cheaper once. The promotion loop is evidence-driven:

Observe repeated request shapes.
Group stable traffic into patterns.
Prepare candidate optimized routes only when the task is stable enough.
Check candidates against relevant history.
Compare candidates against live traffic without firing side effects.
Require policy or human approval when configured.
Route matching future traffic through the cheapest safe proven path.

Force a learning pass during a pilot:

curl -X POST http://localhost:4100/api/v1/learning/tick \
  -H 'authorization: Bearer <token>'

Evidence to look for:

Evidence	Why it matters
Pattern confidence	Shows whether Punk sees stable repeated work.
Evidence notes	Explain why a candidate is eligible, blocked, or waiting for more samples.
History check	Shows how the candidate performed against relevant prior work.
Live comparison	Compares candidate behavior against live traffic without firing effects.
Artifact receipt	Records what was promoted, by whom, and with what evidence.
Route explanation	Shows why a future request used live, cache, semantic cache, or artifact.

Promotion readiness checklist:

The pattern represents real repeated work, not test noise.
Outputs have an objective or reviewable contract.
History checks passed against enough relevant work.
Live comparison did not create hidden side effects.
Governance allows the promoted route.
A human owner understands rollback.
Canary mode is enabled if the first production rollout should be gradual.

Enable canaries:

curl -X PUT http://localhost:4100/api/v1/settings \
  -H 'content-type: application/json' \
  -H 'authorization: Bearer <token>' \
  -d '{ "key": "canary_enabled", "value": true }'

Use the dashboard Learning view for the full evidence trail.

Phase 10: Production Deployment

Production posture depends on whether you run a long-lived server or a serverless deployment.

Core production environment:

Area	Variables
Auth	`PUNK_API_KEY`, `PUNK_ADMIN_EMAIL`, `PUNK_ADMIN_PASSWORD`, `PUNK_REQUIRE_LOGIN=true`
Storage	`PUNK_DATABASE_URL` or `DATABASE_URL`
Secrets	`PUNK_ENCRYPTION_KEY`
Providers	`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `OPENROUTER_API_KEY`, or tenant BYOK
App URLs	`PUNK_APP_BASE_URL`, `PUNK_APP_HOST`, `PUNK_MARKETING_HOST`, `PUNK_MEET_HOST`
Docs	`PUNK_DOCS_DIR` if docs are not in the default repo location
Workers	`PUNK_WORKER_POLL_MS`, `PUNK_WORKER_CONCURRENCY`
Serverless cron	`PUNK_CRON_SECRET`, `CRON_SECRET`
Retention	`PUNK_RETENTION_DAYS`
Email	`RESEND_API_KEY`, `PUNK_EMAIL_FROM`
Billing	`PUNK_BILLING_DISABLED`, `STRIPE_SECRET_KEY`, `STRIPE_WEBHOOK_SECRET`, `STRIPE_PRICE_*`

Long-lived server:

bun run dev

For a separate worker process:

bun run worker

Serverless or Vercel-style deployment:

Configure Postgres or Neon-compatible storage.
Configure PUNK_CRON_SECRET and CRON_SECRET.
Schedule /api/v1/internal/tick once per minute.
Verify the tick endpoint drains learning, workflow, webhook, and retention jobs.
Confirm PUNK_FAILOVER_TO_MOCK is not silently serving simulated content for live customers unless explicitly intended.

Production readiness:

/health returns healthy.
Dashboard login is required.
Docs and health remain public as intended.
Gateway routes require the intended auth.
Database migrations or schema initialization have run.
Worker or cron tick is draining queues.
Provider calls use intended platform or tenant keys.
Retention sweep is configured.
Backups exist for the production database.
Billing and quota behavior matches the commercial plan.

Use the Production readiness panel and Configuration for deployment settings.

Phase 11: Team Rollout

Once the first workflow is observable and governed, bring in the pilot team.

Team setup:

Create user accounts or enable public signup only if intended.
Invite workflow owners, operators, security reviewers, and app developers.
Issue tenant API keys per app or integration.
Avoid sharing admin tokens.
Give each pilot workflow a named owner.

Working agreements:

Agreement	Why it matters
Every app sends app, agent, and subject identity	Makes trust, audit, and routing explainable.
New tools declare side-effect level	Prevents silent unsafe writes.
Risky actions start observe-only	Lets policy review happen before blocking or optimizing.
Promotions require evidence review	Keeps cheaper routes from becoming uncontrolled shortcuts.
Rollbacks are exercised	Operators know how to recover before real incidents.

Dashboard rituals:

Cadence	Review
Daily during pilot	Failed runs, blocked actions, pending approvals, top spend, learning attempts.
Twice weekly	Patterns, optimized routes, evidence, route mix, canary behavior.
Weekly	Policy changes, retention/redaction settings, provider key usage, savings report.
Before production expansion	Security checklist, rollback drill, customer-facing impact review.

Anti-Patterns

Avoid these common onboarding mistakes:

Judging Punk with a one-off creative prompt.
Using a prompt that is already too cheap to optimize meaningfully.
Sending all traffic without app, agent, and subject identity.
Enabling optimize mode before observing policy and route behavior.
Caching or promoting workflows that perform writes without idempotency or approvals.
Treating the workflow diagnostic as a model benchmark.
Promoting an optimization without enough evidence.
Storing provider keys without PUNK_ENCRYPTION_KEY.
Exposing open dev mode publicly.
Mixing pilot test noise with production-like traffic and then trusting the pattern.
Hiding all workflow structure in one huge prompt instead of naming inputs, steps, gates, and outputs.

Definition Of Done

Local evaluation is done when:

Punk runs locally.
Demo or chat traffic appears in Runs.
Repeated work appears in Patterns or an abstention is clearly explained in Learning.
The evaluator can explain the route, cost, latency, and policy verdict for a run.

Pilot integration is done when:

One real app routes through /v1/chat/completions or /v1/messages.
App, agent, and subject identity are present.
At least one workflow candidate is documented.
Tool side effects are classified.
Observe-mode policy results are reviewed.
Learning evidence is visible.

Production onboarding is done when:

Auth, login, storage, encryption, provider keys, workers, and retention are configured.
Governance policy covers the first workflow.
Risky actions require approval or are denied.
Promotion has enough evidence.
Rollback is understood.
The team has an operating cadence.
The owner can explain what Punk is allowed to optimize and what must remain live.

Troubleshooting During Onboarding

Symptom	First checks
The response is simulated	Confirm provider env vars or tenant BYOK; check `PUNK_PROVIDER` and `PUNK_FAILOVER_TO_MOCK`.
Gateway returns 401	Add `Authorization: Bearer <token>` or confirm the tenant API key is valid.
Dashboard requires login	Use the bootstrap admin from `PUNK_ADMIN_EMAIL` and `PUNK_ADMIN_PASSWORD`, or create a user in open dev mode.
No patterns appear	Send several similar requests with stable identity and input shape.
No optimization promotes	Check Learning evidence notes, approval settings, and side-effect level.
Web fetch fails	Check `PUNK_ALLOW_PRIVATE_WEB_FETCH`, URL safety, and network access.
Webhook or MCP tool is blocked	Check policy, credentials, private-network controls, and side-effect level.
Scheduled agents do not run	Confirm worker process or serverless tick is active.
Costs do not drop	Confirm the workflow is repeated, stable, and eligible for cache, semantic cache, model substitution, or artifact routing.
Route stays live	Review route explanation; Punk may be correctly avoiding an unproven or unsafe shortcut.

//DOCS Onboarding Guide