Onboarding Guide
This is the extended guide for taking Punk from evaluation to first production traffic. If you only need a fast local walkthrough, start with Punk in 30 Minutes. Use this guide when you are planning a real pilot, a customer proof of value, or a team rollout.
Punk works best when the work is operational: repeated, evidence-bearing, policy-sensitive, or expensive enough that routing decisions matter. It is not primarily a generic prompt benchmark for one-off creative drafts or trivial questions.
Hosted reference: cheaperfastersafer.com. Local default: http://localhost:4100.
Who This Is For
| Reader | What this guide helps you finish |
|---|---|
| Evaluator | Pick the right workflow, run a credible local proof, and know what evidence to inspect. |
| App developer | Route an existing OpenAI- or Anthropic-compatible app through Punk without rewriting the app. |
| Workflow builder | Convert repeated agent work into a workflow with inputs, gates, receipts, and replayable outputs. |
| Operator | Set up auth, storage, workers, provider keys, health checks, retention, and billing posture. |
| Security/GRC reviewer | Understand identity, side effects, approval gates, redaction, audit, and private-network controls. |
| Team admin | Invite users, issue scoped keys, and make the dashboard useful for a pilot team. |
Outcomes
By the end of onboarding, you should have:
- A local Punk gateway and dashboard.
- One real workflow candidate selected for evaluation.
- At least one observed agent/app routed through Punk.
- Traceable runs with app, agent, and subject identity.
- A governance posture for read-only, reversible, user-visible, and high-impact actions.
- An optimization-evidence view that shows whether Punk found a stable pattern.
- A decision on whether the pilot remains observe-only or starts optimized routing.
- A production readiness checklist for storage, auth, workers, provider keys, and operations.
Phase 1: Choose The Right Work
Start with a workflow, not a random prompt. Punk proves value when a repeated job has structure that can be observed, governed, measured, and approved for optimization.
Good first candidates:
| Candidate | Why it is a good fit |
|---|---|
| Support triage | Repeated classification, structured output, low-risk reads, clear evaluation criteria. |
| Vendor review | Web evidence, invoice or profile data, thresholds, approval gates, reusable scorecards. |
| Pricing monitor | Web reads, structured extraction, snapshots, diffs, repeatable schedule. |
| Lead enrichment | Web and CRM reads, field normalization, policy-controlled writes. |
| Compliance precheck | Evidence collection, deterministic gates, receipts, human approval before action. |
| Internal research brief | Repeatable source policy, citation/evidence burden, reusable templates. |
Poor first candidates:
| Candidate | Why to avoid it for the first proof |
|---|---|
| One-off creative writing | Subjective quality dominates; repeatability and route proof are weak. |
| Tiny factual questions | The baseline is already cheap and fast; savings will be uninteresting. |
| Unbounded brainstorming | Hard to define correctness, side effects, or replay evidence. |
| Fully manual workflows | Punk needs agent/app traffic to observe and improve. |
| High-impact writes on day one | Start in observe mode until policy and approval paths are clear. |
Use the public Workflow Diagnostic before a pilot call or scoping session. It compares a standard serial agent loop with Punk workflow mode across repeatability, evidence burden, side-effect risk, governance gates, receipts, review value, cost, and latency. Treat it as a workflow-fit diagnostic, not a generic leaderboard.
Phase 2: Define The Pilot Contract
Before running traffic, write down the pilot contract in plain language.
| Question | Example answer |
|---|---|
| What job are we evaluating? | "Review a new vendor and invoice, check the vendor site, flag spend over $5,000, and prepare a scorecard." |
| What input shape repeats? | Vendor URL, invoice PDF or extracted invoice fields, requester, department, spend amount. |
| What output shape matters? | JSON scorecard plus human-readable rationale and approval recommendation. |
| What evidence is required? | Vendor website snapshot, invoice fields, policy threshold, risk flags. |
| What actions are risky? | Emailing finance, creating a ticket, approving spend, storing vendor records. |
| What can be cached or promoted? | Stable extraction, policy threshold logic, scorecard shell, known vendor profile. |
| Who approves promotion? | Pilot operator or workflow owner. |
| What success metric matters? | Lower cost after proof, fewer hidden side effects, receipts for every action, less manual review. |
Use consistent identifiers from the first run:
| Identifier | Recommendation |
|---|---|
| Tenant | One tenant per company, team, or pilot customer. |
| App | Product surface or integration name, for example finance-review-app. |
| Agent | Stable actor name, for example vendor-review-agent. |
| Subject | The end user, account, customer, vendor, ticket, or workflow instance being acted on. |
Punk uses these identifiers for trust, audit, policy, cost, pattern discovery, and routing. Missing identity makes the pilot harder to interpret.
Phase 3: Run Punk Locally
Install dependencies and start the gateway:
bun install
bun run dev
Open http://localhost:4100.
Default local behavior:
| Area | Default |
|---|---|
| Port | 4100 |
| Database | data/punk.db |
| Provider | offline mock when no matching live provider key is configured |
| Auth | open dev mode when PUNK_API_KEY is unset |
| Worker | embedded in the API process |
| Learning | background tick enabled |
| Dashboard | served at / |
| Docs | served at /docs |
If the dashboard is blank, use the Getting started panel to seed demo data. For repeatable optimization traffic, keep the gateway running and use another terminal:
bun run demo
Inspect the dashboard after the first demo run:
| Dashboard area | What to verify |
|---|---|
| Overview | Recent activity, route mix, spend, savings, health. |
| Runs | Every model request has route, cost, latency, trace, and explanation. |
| Patterns | Repeated request shapes are grouped. |
| Artifacts | Candidate optimized routes show evidence, promotion, and rollback state. |
| Learning | Evidence notes explain what is eligible, blocked, or waiting for more samples. |
| Web | Compact page snapshots show structured page state and token savings. |
| Governance | Policies, users, keys, credentials, MCP servers, audit, approvals. |
| Workflows | Templates, graph editor, run panel, node timelines. |
Local success criteria:
- You can load
/,/docs, and/health. - A chat, workflow, or demo run appears in
Runs. - A repeated request can be recognized as a pattern.
- The run detail explains route choice and cost.
- The learning page explains whether a candidate is eligible or why it needs more evidence.
Phase 4: Connect One Real App In Observe Mode
The simplest integration is a base URL swap. Keep your existing OpenAI-compatible request shape, point it at Punk, and send identity headers.
curl http://localhost:4100/v1/chat/completions \
-H 'content-type: application/json' \
-H 'x-punk-app: finance-review-app' \
-H 'x-punk-agent: vendor-review-agent' \
-H 'x-punk-subject: vendor:acme-123' \
-d '{
"model": "gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "Review this vendor profile and return risk, rationale, and next action."
}
]
}'
For Anthropic-compatible apps, use the native Messages endpoint:
curl http://localhost:4100/v1/messages \
-H 'content-type: application/json' \
-H 'x-punk-app: finance-review-app' \
-H 'x-punk-agent: vendor-review-agent' \
-H 'x-punk-subject: vendor:acme-123' \
-d '{
"model": "claude-haiku-4-5",
"max_tokens": 512,
"messages": [
{
"role": "user",
"content": "Review this vendor profile and return risk, rationale, and next action."
}
]
}'
If you set PUNK_API_KEY, include bearer auth:
-H 'authorization: Bearer <token>'
Start in observe mode for consequential work. Observe mode records what policy would do without blocking live work. Use optimize mode only after the trace, governance, and promotion paths are understood.
What to inspect after the first real app run:
| Evidence | Where |
|---|---|
| Model provider and key source | Run detail trace and route explanation. |
| Prompt shape and token use | Run detail. |
| App, agent, subject | Run detail and governance records. |
| Policy verdict | Run detail and Governance audit. |
| Cost and latency | Runs table and Overview. |
| Repeated pattern | Patterns and Learning after several similar requests. |
Phase 5: Add Provider Keys Deliberately
With no live provider key, Punk uses the deterministic mock provider for local work. For live calls, configure platform keys or tenant BYOK.
Platform env keys:
| Provider | Env vars |
|---|---|
| OpenAI | OPENAI_API_KEY, optional OPENAI_BASE_URL |
| Anthropic | ANTHROPIC_API_KEY, optional ANTHROPIC_BASE_URL |
| OpenRouter | OPENROUTER_API_KEY, optional OPENROUTER_BASE_URL |
| DeepSeek | DEEPSEEK_API_KEY, optional DEEPSEEK_BASE_URL |
| Kimi/Moonshot | MOONSHOT_API_KEY or KIMI_API_KEY, optional base URL |
Tenant BYOK stores a tenant-owned provider key in the encrypted credentials vault. Set PUNK_ENCRYPTION_KEY before relying on stored credentials outside local dev.
Example tenant key:
curl -X POST http://localhost:4100/api/v1/credentials \
-H 'content-type: application/json' \
-H 'authorization: Bearer <token>' \
-d '{
"name": "openai",
"provider": "openai",
"secret": { "value": "sk-..." }
}'
Read Configuration before mixing platform keys and tenant keys in production.
Phase 6: Classify Tools And Side Effects
Punk can govern direct model calls, SDK tool traces, workflow tool nodes, web sessions, and webhook effects. The important first step is classifying side effects.
| Level | Meaning | Pilot posture |
|---|---|---|
| 0 | Pure computation | Safe to observe and optimize early. |
| 1 | Read-only external | Good first pilot scope. |
| 2 | Reversible or idempotent write | Require identity, idempotency, and audit. |
| 3 | User-visible write | Start observe-only; add approval rules before optimize. |
| 4 | High-impact write | Require explicit policy, approval, and rollback plan. |
Undeclared SDK tools default to side-effect level 3. That is intentional: unknown tools are treated like user-visible writes.
For app code, use the TypeScript SDK when you need tool tracing, feedback, web fetch, or web sessions. Keep the first integration small:
- Route model calls through Punk.
- Add identity headers.
- Add tool tracing around one or two important tools.
- Mark tool side-effect levels.
- Verify trace and governance events.
- Expand only after the first path is observable.
Read SDK, API, and Governance for the exact client and HTTP surfaces.
Phase 7: Build The Workflow Version
Once the repeated job is visible, decide whether it should remain a chat/agent flow or become a workflow.
| Surface | Use when |
|---|---|
| Gateway | You need a low-friction base URL swap for an existing agent. |
| Chat | A human is actively testing prompts and route behavior. |
| Agent | One scheduled or on-demand task can be represented as start -> llm -> output. |
| Workflow | The job has multiple steps, branches, tools, web reads, gates, or structured outputs. |
| Chorus | The job needs governed multi-model answers with evidence receipts. |
Workflow design checklist:
- Inputs are explicit JSON, not hidden in prose.
- Every web or external read has a named step.
- Risky actions are separate from reasoning steps.
- Side effects have declared levels.
- Gates are stated as policy or workflow conditions.
- Outputs have a stable schema.
- Receipts and evidence are preserved.
- The workflow can be reviewed without firing real side effects.
- The owner can explain what would be promoted and what must remain live.
Start from the dashboard templates:
| Template | First use |
|---|---|
support-triage | Ticket classification and conditional notification. |
web-research | Web fetch plus model summary. |
pricing-monitor | Scheduled web reads and structured extraction. |
Run the workflow several times with similar inputs. Then inspect Runs, Patterns, Learning, and Artifacts to see whether Punk found a stable route.
Phase 8: Governance And Security Review
Do this before optimize mode or production exposure.
Access and identity:
- Set
PUNK_API_KEYfor protected API and gateway routes. - Bootstrap dashboard users with
PUNK_ADMIN_EMAIL,PUNK_ADMIN_PASSWORD, and optionallyPUNK_REQUIRE_LOGIN=true. - Use tenant API keys for apps rather than sharing the bootstrap admin token.
- Pin keys to app ids when possible.
- Send
X-Punk-App,X-Punk-Agent, andX-Punk-Subject.
Secrets and credentials:
- Set
PUNK_ENCRYPTION_KEYbefore storing provider keys, workflow credentials, or MCP credentials. - Store provider BYOK keys under Governance -> Provider keys or
/api/v1/credentials. - Do not put secrets in prompts, workflow inputs, or trace-visible metadata.
Network controls:
- Leave
PUNK_ALLOW_PRIVATE_WEB_FETCH=falsein authenticated deployments unless private fetches are intended. - Leave
PUNK_ALLOW_PRIVATE_WEBHOOKS=falseunless private webhook targets are intended. - Review web session and webhook destinations before enabling writes.
Policy and approvals:
- Keep policies in
PUNK_POLICIES_DIR. - Declare allow, deny, and approval-required rules for the pilot app and agent.
- Require approval for side-effect levels 3 and 4 unless the workflow owner explicitly accepts the risk.
- Use observe mode first to see what would have been blocked.
- Review audit events and pending approvals in the dashboard.
Data controls:
- Decide
retention_days. - Enable tenant setting
redaction=truewhen tool payloads may include sensitive fields. - Review
streaming_dlp=trueif secrets or regulated identifiers could leave the gateway. - Use tripwires for sensitive decoy values that should never appear in prompts or outputs.
Security review done means:
- A non-admin app key exists.
- The app key is scoped or pinned where possible.
- Provider and tool credentials are encrypted.
- Risky actions are gated.
- Private-network escape hatches are intentionally set.
- Retention and redaction are chosen.
- The operator can find audit records for a run, approval, and policy decision.
Phase 9: Evidence And Promotion
Punk does not promote a cheaper route just because it was cheaper once. The promotion loop is evidence-driven:
- Observe repeated request shapes.
- Group stable traffic into patterns.
- Prepare candidate optimized routes only when the task is stable enough.
- Check candidates against relevant history.
- Compare candidates against live traffic without firing side effects.
- Require policy or human approval when configured.
- Route matching future traffic through the cheapest safe proven path.
Force a learning pass during a pilot:
curl -X POST http://localhost:4100/api/v1/learning/tick \
-H 'authorization: Bearer <token>'
Evidence to look for:
| Evidence | Why it matters |
|---|---|
| Pattern confidence | Shows whether Punk sees stable repeated work. |
| Evidence notes | Explain why a candidate is eligible, blocked, or waiting for more samples. |
| History check | Shows how the candidate performed against relevant prior work. |
| Live comparison | Compares candidate behavior against live traffic without firing effects. |
| Artifact receipt | Records what was promoted, by whom, and with what evidence. |
| Route explanation | Shows why a future request used live, cache, semantic cache, or artifact. |
Promotion readiness checklist:
- The pattern represents real repeated work, not test noise.
- Outputs have an objective or reviewable contract.
- History checks passed against enough relevant work.
- Live comparison did not create hidden side effects.
- Governance allows the promoted route.
- A human owner understands rollback.
- Canary mode is enabled if the first production rollout should be gradual.
Enable canaries:
curl -X PUT http://localhost:4100/api/v1/settings \
-H 'content-type: application/json' \
-H 'authorization: Bearer <token>' \
-d '{ "key": "canary_enabled", "value": true }'
Use the dashboard Learning view for the full evidence trail.
Phase 10: Production Deployment
Production posture depends on whether you run a long-lived server or a serverless deployment.
Core production environment:
| Area | Variables |
|---|---|
| Auth | PUNK_API_KEY, PUNK_ADMIN_EMAIL, PUNK_ADMIN_PASSWORD, PUNK_REQUIRE_LOGIN=true |
| Storage | PUNK_DATABASE_URL or DATABASE_URL |
| Secrets | PUNK_ENCRYPTION_KEY |
| Providers | OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY, or tenant BYOK |
| App URLs | PUNK_APP_BASE_URL, PUNK_APP_HOST, PUNK_MARKETING_HOST, PUNK_MEET_HOST |
| Docs | PUNK_DOCS_DIR if docs are not in the default repo location |
| Workers | PUNK_WORKER_POLL_MS, PUNK_WORKER_CONCURRENCY |
| Serverless cron | PUNK_CRON_SECRET, CRON_SECRET |
| Retention | PUNK_RETENTION_DAYS |
RESEND_API_KEY, PUNK_EMAIL_FROM | |
| Billing | PUNK_BILLING_DISABLED, STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET, STRIPE_PRICE_* |
Long-lived server:
bun run dev
For a separate worker process:
bun run worker
Serverless or Vercel-style deployment:
- Configure Postgres or Neon-compatible storage.
- Configure
PUNK_CRON_SECRETandCRON_SECRET. - Schedule
/api/v1/internal/tickonce per minute. - Verify the tick endpoint drains learning, workflow, webhook, and retention jobs.
- Confirm
PUNK_FAILOVER_TO_MOCKis not silently serving simulated content for live customers unless explicitly intended.
Production readiness:
/healthreturns healthy.- Dashboard login is required.
- Docs and health remain public as intended.
- Gateway routes require the intended auth.
- Database migrations or schema initialization have run.
- Worker or cron tick is draining queues.
- Provider calls use intended platform or tenant keys.
- Retention sweep is configured.
- Backups exist for the production database.
- Billing and quota behavior matches the commercial plan.
Use the Production readiness panel and Configuration for deployment settings.
Phase 11: Team Rollout
Once the first workflow is observable and governed, bring in the pilot team.
Team setup:
- Create user accounts or enable public signup only if intended.
- Invite workflow owners, operators, security reviewers, and app developers.
- Issue tenant API keys per app or integration.
- Avoid sharing admin tokens.
- Give each pilot workflow a named owner.
Working agreements:
| Agreement | Why it matters |
|---|---|
| Every app sends app, agent, and subject identity | Makes trust, audit, and routing explainable. |
| New tools declare side-effect level | Prevents silent unsafe writes. |
| Risky actions start observe-only | Lets policy review happen before blocking or optimizing. |
| Promotions require evidence review | Keeps cheaper routes from becoming uncontrolled shortcuts. |
| Rollbacks are exercised | Operators know how to recover before real incidents. |
Dashboard rituals:
| Cadence | Review |
|---|---|
| Daily during pilot | Failed runs, blocked actions, pending approvals, top spend, learning attempts. |
| Twice weekly | Patterns, optimized routes, evidence, route mix, canary behavior. |
| Weekly | Policy changes, retention/redaction settings, provider key usage, savings report. |
| Before production expansion | Security checklist, rollback drill, customer-facing impact review. |
Anti-Patterns
Avoid these common onboarding mistakes:
- Judging Punk with a one-off creative prompt.
- Using a prompt that is already too cheap to optimize meaningfully.
- Sending all traffic without app, agent, and subject identity.
- Enabling optimize mode before observing policy and route behavior.
- Caching or promoting workflows that perform writes without idempotency or approvals.
- Treating the workflow diagnostic as a model benchmark.
- Promoting an optimization without enough evidence.
- Storing provider keys without
PUNK_ENCRYPTION_KEY. - Exposing open dev mode publicly.
- Mixing pilot test noise with production-like traffic and then trusting the pattern.
- Hiding all workflow structure in one huge prompt instead of naming inputs, steps, gates, and outputs.
Definition Of Done
Local evaluation is done when:
- Punk runs locally.
- Demo or chat traffic appears in
Runs. - Repeated work appears in
Patternsor an abstention is clearly explained inLearning. - The evaluator can explain the route, cost, latency, and policy verdict for a run.
Pilot integration is done when:
- One real app routes through
/v1/chat/completionsor/v1/messages. - App, agent, and subject identity are present.
- At least one workflow candidate is documented.
- Tool side effects are classified.
- Observe-mode policy results are reviewed.
- Learning evidence is visible.
Production onboarding is done when:
- Auth, login, storage, encryption, provider keys, workers, and retention are configured.
- Governance policy covers the first workflow.
- Risky actions require approval or are denied.
- Promotion has enough evidence.
- Rollback is understood.
- The team has an operating cadence.
- The owner can explain what Punk is allowed to optimize and what must remain live.
Troubleshooting During Onboarding
| Symptom | First checks |
|---|---|
| The response is simulated | Confirm provider env vars or tenant BYOK; check PUNK_PROVIDER and PUNK_FAILOVER_TO_MOCK. |
| Gateway returns 401 | Add Authorization: Bearer <token> or confirm the tenant API key is valid. |
| Dashboard requires login | Use the bootstrap admin from PUNK_ADMIN_EMAIL and PUNK_ADMIN_PASSWORD, or create a user in open dev mode. |
| No patterns appear | Send several similar requests with stable identity and input shape. |
| No optimization promotes | Check Learning evidence notes, approval settings, and side-effect level. |
| Web fetch fails | Check PUNK_ALLOW_PRIVATE_WEB_FETCH, URL safety, and network access. |
| Webhook or MCP tool is blocked | Check policy, credentials, private-network controls, and side-effect level. |
| Scheduled agents do not run | Confirm worker process or serverless tick is active. |
| Costs do not drop | Confirm the workflow is repeated, stable, and eligible for cache, semantic cache, model substitution, or artifact routing. |
| Route stays live | Review route explanation; Punk may be correctly avoiding an unproven or unsafe shortcut. |
Read Next
- Punk in 30 Minutes: the fast local walkthrough.
- Workflow Diagnostic: public workflow-fit diagnostic.
- Workflows: workflow templates, scheduling, credentials, MCP tools.
- Chat & Agents: chat economics, save-as-agent, scheduled task agents.
- SDK: TypeScript client and tracing helpers.
- API: HTTP endpoints, auth, identity headers, response conventions.
- Governance: policies, trust tiers, approvals, audit, observe mode.
- Configuration: env vars, provider modes, auth, databases, tenant settings.
- Billing & Usage: plans, quotas, usage metering, Stripe.
- Troubleshooting: common symptoms and fixes.