PMAI PM Playbook

AI PM Playbook

The full operating model, evidence hierarchy, readiness scoring, and decision framework for PMs building production AI products.

13 sections~ 45 min cover to coverMIT licensed

Practical open source playbook for product managers building production AI products.

AI PM Playbook helps product managers move AI ideas from vague opportunity to production-ready product work. It gives PMs the artifacts, decision gates, eval language, launch criteria, and operating rituals needed to build AI products that users can trust.

This is not a prompt pack. It is not a model leaderboard. It is not a collection of AI hot takes.

It is a working system for PMs building AI products.

Why This Exists

AI product work has changed.

PMs building with AI are not only writing requirements for screens and workflows. They are defining:

  • What job the AI performs.
  • What the AI must never do.
  • What data the system can use.
  • What quality means.
  • How output is evaluated.
  • When humans must review or approve.
  • What happens when confidence is low.
  • How cost scales with usage.
  • How launch risk is managed.
  • How quality is monitored after release.

Many AI products fail in the space between a promising demo and a production workflow. The demo works. The product does not.

Common failure modes

  • The AI feature solves a weak or unclear problem.
  • The PRD says "assistant" but does not define the AI job.
  • Quality is judged by a few hand-picked examples.
  • The team has no eval set or regression process.
  • Users cannot inspect, correct, or override the AI.
  • The system has no fallback for low-confidence outputs.
  • Cost per workflow is discovered too late.
  • Risk review happens after the product is built.
  • Launch happens without observability.

AI PM Playbook exists to make those gaps visible before they become production failures.

Who This Is For

Use this playbook if you are

  • A PM building AI features in SaaS, consumer, fintech, healthcare, education, devtools, marketplaces, or internal tools.
  • A founder deciding which AI workflows are worth building.
  • A Staff, Lead, or Group PM creating an AI product operating model.
  • A product leader reviewing whether an AI roadmap is credible.
  • A design, engineering, data, security, or legal partner who needs clearer AI product artifacts.

This playbook is especially useful when the product uses:

  • LLMs.
  • Agents.
  • Retrieval augmented generation.
  • AI copilots.
  • AI workflow automation.
  • AI summarization, extraction, ranking, scoring, classification, or generation.
  • Human-in-the-loop review flows.
  • AI systems that touch customer data, user decisions, business logic, or regulated workflows.

This playbook is not for

  • Prompt-only collections.
  • Model rankings without product context.
  • Pure research benchmarking.
  • Full-stack AI app templates.
  • Generic innovation theater.

What This Playbook Helps You Do

AI PM Playbook helps you make better product decisions before, during, and after AI implementation.

PM jobPlaybook support
Decide whether an AI idea is worth buildingAI Opportunity Brief
Define the AI's job clearlyAI PRD
Convert PM intent into build-ready workOptional AI Build Brief
Define qualityAI Eval Plan
Make failure modes visibleAI PRD risk table and Launch Gate Checklist
Design human reviewHuman Review Workflow
Decide if the product can launchLaunch Gate Checklist
Track production behaviorObservability Plan
Understand unit economicsAI Cost Model
Run the product after launchObservability Plan weekly review

The goal is to replace

"This AI demo feels good."

with:

"This AI workflow is ready for a limited pilot because problem fit, data readiness,
eval coverage, human review, fallback behavior, cost, and observability meet the
pilot gate. These three gaps must close before production launch."

Quick Start

1. Start With The Opportunity

If you are about to build a demo, start with docs/before-you-vibe-code.md. It is the 10-minute preflight check before writing a PRD or opening an AI builder.

Use templates/ai-opportunity-brief.md.

Answer

  • What user problem are we solving?
  • Why is AI appropriate?
  • What is the smallest useful version?
  • What happens if we do nothing?
  • What existing workflow or product path could solve this without AI?

2. Write The AI PRD

Use templates/ai-prd.md.

Define

  • AI job.
  • Inputs and outputs.
  • User workflow.
  • Autonomy level.
  • Human review rules.
  • Quality bar.
  • Failure behavior.
  • Observability requirements.

3. Create The Eval Plan

Use templates/ai-eval-plan.md.

Define

  • Golden examples.
  • Quality rubric.
  • Edge cases.
  • Unacceptable outputs.
  • Regression checks.
  • Human review process.
  • Launch threshold.

4. Complete The Launch Gate

Use templates/launch-gate-checklist.md.

Decide

  • Prototype only.
  • Pilot candidate.
  • Pilot ready with conditions.
  • Limited production ready.
  • Scale ready.
  • Do not launch.

5. Review The Readiness Assessment

Read one of the example readiness-assessment.yaml files and compare the dimension scores, evidence, risks, next actions, and recommendation. The schema in schema/ defines the structure if you want to validate or automate this later.

AI PM Workflow

Every serious AI product moves through this loop.

1. Pick the right opportunity
2. Prototype to validate (time-boxed, disposable)
3. Define the AI job
4. Design the user workflow
5. Specify data, model, and system requirements
6. Define evals and quality bars
7. Identify risks and mitigations
8. Model cost and business value
9. Set launch gates
10. Plan staged rollout (shadow, canary, cohort)
11. Instrument production behavior
12. Operate and improve after launch

Step 2 is deliberate: use a time-boxed prototype to learn what is possible before over-specifying the PRD. Prototypes reveal behavior. PRDs capture why the work matters, what quality means, and what the team has agreed to ship. You need both.

Each step creates or updates one artifact. If a meeting, document, or checklist does not help the team make a better product decision, skip it.

Core Artifacts

AI Opportunity Brief

Use this before writing a PRD.

Purpose

Decide whether an AI idea is worth pursuing.

Must answer

  • What user problem does this solve?
  • Who has this problem?
  • How do they solve it today?
  • Why is AI appropriate?
  • Why is deterministic software insufficient?
  • What is the smallest useful version?
  • What existing product path, workflow, or code already solves this?
  • What happens if we do nothing?
  • What business outcome improves?
  • What risks are already obvious?

Decision options

  • Pursue.
  • Prototype.
  • Defer.
  • Reject.

AI PRD

Use this once the opportunity is worth exploring.

Purpose

Define the AI product clearly enough for engineering, design, data, and risk partners to build and review.

Required sections

  • Problem.
  • Goals.
  • Non-goals.
  • Target users.
  • Current workflow.
  • Proposed workflow.
  • AI job.
  • Model requirements (model selection, token budgets, streaming, multi-model routing).
  • System persona (tone, constraints, persona boundaries).
  • Data provenance (sources, permissions, retention, regulatory constraints).
  • Input contract.
  • Output contract.
  • Autonomy level.
  • Agent tool boundaries (for agentic products: allowed tools, constraints, escalation).
  • Example inputs and outputs (including rejection cases).
  • Human review rules.
  • Quality bar.
  • Latency target.
  • Cost constraint.
  • Failure and fallback behavior.
  • Observability requirements.
  • Launch gates.
  • Open questions.

The most important line in the PRD is the AI job:

The AI is responsible for [specific task] using [allowed inputs] to produce
[specific output] for [user] inside [workflow], subject to [review or safety rule].

AI Eval Plan

Use this before trusting model behavior.

Purpose

Define what "good" means and how the team will measure it.

Required sections

  • Task definition.
  • Eval scope (node, session, and system level for agentic products).
  • Evaluation dataset.
  • Golden examples.
  • Quality rubric.
  • Grading tiers (code-based, LLM-as-judge, human review).
  • Non-deterministic eval strategy (multiple runs, variance tracking).
  • Edge cases.
  • Unacceptable failures.
  • Automated checks.
  • Regression plan.
  • Online metrics.
  • Data flywheel (production failures feed back into eval set).
  • Launch threshold.
  • Review cadence.

Minimum eval standard

  • One representative happy path.
  • One common edge case.
  • One unacceptable output.
  • One safety or policy boundary.
  • One regression case from a real failure once available.

Eval anti-patterns to watch

  • Evaluating on a golden set built from one model's outputs (reinforces that model's biases).
  • Treating 100% pass rate as success (usually means the eval set is too easy).
  • Grading agents on exact tool-call sequences instead of outcomes.
  • Relying on a single eval run for non-deterministic outputs.
  • Skipping integration tests at agent boundaries.

Risks And Mitigations

Use the PRD risk table and launch gate before pilot and production launch.

Purpose

Make AI product risk concrete, owned, and reviewable without creating extra artifact sprawl.

Useful risk categories

  • Incorrect output.
  • Over-trust.
  • Data leakage.
  • Permission failure.
  • Prompt injection.
  • Unsafe autonomy.
  • Regulatory exposure.
  • Bias or unfair treatment.
  • Cost spike.
  • Silent degradation.

For agentic products, also consider agent-specific failure modes:

  • Goal hijacking.
  • Tool misuse.
  • Identity abuse.
  • Memory poisoning.
  • Error cascading across steps.

Human Review Workflow

Use this when the AI affects user decisions, customer communication, money, health, legal, identity, permissions, or irreversible actions.

Purpose

Define when humans review, approve, correct, or override AI behavior.

Required sections

  • Actions AI can take alone.
  • Actions AI can suggest only.
  • Actions AI must never take.
  • Required review points.
  • Review UI requirements.
  • Escalation path.
  • Audit trail.
  • Feedback captured from review.

Launch Gate Checklist

Use this before pilot, production, and scale-up.

Purpose

Decide whether the AI product can move to the next release stage.

Gate areas

  • Product value.
  • AI job definition.
  • Data readiness.
  • Eval readiness.
  • UX and trust.
  • Human review.
  • Risk and compliance.
  • Cost.
  • Observability.
  • Support.
  • Post-launch owner.

Observability Plan

Use this before any customer-facing launch.

Purpose

Define what the team will monitor after launch.

Track

  • Adoption.
  • Task completion.
  • Accept rate.
  • Edit rate.
  • Reject rate.
  • Retry rate.
  • Escalation rate.
  • User-reported issue rate.
  • Latency.
  • Cost per task.
  • Cost per customer.
  • Quality score.
  • Regression failures.

Technical logs alone are not enough. The PM needs product-quality signals.

The observability plan also includes the weekly post-launch review: what changed, usage, quality, cost, latency, user feedback, incidents, top failure modes, product decisions, and next actions.

AI Cost Model

Use this before scale.

Purpose

Make AI unit economics visible before usage grows.

Inputs

  • Model or provider cost.
  • Average input size.
  • Average output size.
  • Retrieval or tool cost.
  • Cache hit rate.
  • Expected tasks per user.
  • Expected users per customer.
  • Gross margin target.
  • Pricing assumption.

Outputs

  • Cost per task.
  • Cost per user.
  • Cost per customer.
  • Break-even usage.
  • Sensitivity to heavy users.
  • Agentic cost multiplier (for agent workflows, measure the added cost from tool calls, retries, and context accumulation).
  • Multi-model routing savings (route simple tasks to cheaper models).

Optional Templates

Use these when the team needs more ceremony or a tighter handoff:

  • AI build brief: translate PM intent into scoped engineering work.
  • AI PM review checklist: pressure-test roadmap, design, engineering, legal, or launch meetings.
  • Prompt change record: review, test, roll out, and roll back production prompt changes.

Evidence Hierarchy

Every readiness score, risk assessment, and launch gate decision depends on evidence. Not all evidence is equal. When claims conflict, higher-tier evidence wins.

TierSourceExampleWeight
1Measured dataProduction metrics, A/B test results, eval scores from a labeled datasetHighest
2Direct user evidenceUser interview transcripts, observed behavior in usability tests, support ticketsHigh
3Structured analysisCost models with documented assumptions, competitive analysis with named sourcesMedium
4Stakeholder inputVerbal direction from leadership, sales team anecdotes, partner feedbackMedium-low
5PM judgmentTeam intuition, pattern matching from prior products, industry conventionLowest

Use this hierarchy when filling out any template that asks for evidence: readiness assessments, risk registers, eval plans, opportunity briefs, and decision records. Tag your evidence honestly. A readiness score backed by Tier 1 evidence is worth more than the same score backed by Tier 5.

When a Tier 4 or 5 claim drives a launch decision, flag it. The goal is not to eliminate judgment — it is to make clear when the team is relying on it so the decision can be revisited when harder evidence arrives.

When filling in readiness assessments, tag each piece of evidence with its tier using the shorthand [T1] through [T5]. This makes it immediately visible which claims are backed by data and which are assumptions. See the example readiness assessments for this convention in practice.

Readiness Model

Readiness is scored across 11 dimensions.

Each dimension requires

  • Score from 0 to 5.
  • Evidence (with source tier from the evidence hierarchy above).
  • Risk.
  • Owner.
  • Next action.

Score Scale

ScoreMeaning
0Missing
1Mentioned but undefined
2Drafted but not validated
3Defined with partial evidence
4Launch-ready with strong evidence
5Live, measured, owned, and improving

Readiness Levels

ScoreLevelMeaning
Below 2.0Not readyDo not build or launch yet
2.0 - 2.69Prototype onlyExplore internally
2.7 - 3.29Pilot candidateWorth preparing for pilot, but close blockers first
3.3 - 3.69Pilot ready with conditionsLimited rollout with close review
3.7 - 4.49Limited production readyControlled customer launch
4.5 - 5.0Scale readyReady to expand with monitoring

Dimensions

DimensionPM question
Problem fitDoes AI solve a real user problem better than the current workflow?
Workflow fitDoes the AI fit into a workflow with review, correction, and fallback?
AI job definitionIs the AI task specific enough to evaluate and ship?
Data readinessIs the required data available, permissioned, fresh, and reliable?
Eval readinessCan the team measure quality before and after launch?
System behaviorAre model, retrieval, tools, latency, and fallback behavior specified?
Risk and safetyAre likely harms, misuse cases, and mitigations owned? For agents: are tool boundaries, goal hijacking, and error cascading addressed?
Regulatory readinessIs the risk classification determined? Are data provenance, transparency, and compliance requirements met?
Cost and business caseDoes the workflow have credible unit economics? For agents: is the agentic cost multiplier modeled?
ObservabilityCan the team see whether the AI is working in production?
Launch and operationsIs there a staged rollout and post-launch operating cadence?

Suggested Weights

DimensionWeight
Problem fit1.1
Workflow fit1.1
AI job definition1.2
Data readiness1.2
Eval readiness1.5
System behavior1.0
Risk and safety1.4
Regulatory readiness1.3
Cost and business case1.0
Observability1.3
Launch and operations1.2

Eval readiness is weighted highest because it is the most common gap between demo and production. Risk, regulatory readiness, observability, data, and launch operations are also heavily weighted because AI products degrade in real workflows. Regulatory readiness is weighted at 1.3 because AI-specific regulation, privacy law, and sector rules can materially change what is safe or legal to launch.

The readiness levels have narrow score ranges by design, which means a small scoring difference can shift the recommendation by a level. This is intentional — the score is a guide for structured discussion, not a bright line. When a score falls near a boundary, use the individual dimension scores and hard blockers to make the call. The healthcare intake example demonstrates this: a borderline score of 1.95 fell just below the "prototype only" threshold, but the team had enough problem definition to learn from a synthetic-data prototype, so the recommendation was elevated to "prototype only" with explicit conditions.

Launch Gates

The readiness score is not a simple average. Some gaps block launch.

Block Customer-facing Production If

  • Eval readiness is below 3.
  • No human review exists for high-impact or irreversible actions.
  • No fallback path exists for low-confidence outputs.
  • Observability is below 3.
  • Data permissions are unclear.
  • Cost per task is unknown for expected usage.
  • User harm or compliance risks have no owner.
  • There is no post-launch quality review cadence.
  • Regulatory classification or compliance path is undetermined.
  • No staged rollout plan exists (shadow mode, canary, or cohort-based ramp).
  • No rollback mechanism is tested.

Block Pilot If

  • The AI job is not clearly defined.
  • There is no minimum eval set or rubric for the pilot scope.
  • The team cannot identify unacceptable outputs.
  • No pilot success metric exists.
  • No owner is assigned to review pilot failures.
  • Low-confidence fallback behavior is missing for customer-facing or high-risk workflows.
  • The team cannot observe accept, edit, reject, escalation, cost, and failure signals during the pilot.
  • For agentic products: tool boundaries and escalation paths are not defined.

Block Scale-up If

  • Pilot quality metrics are not reviewed.
  • Cost is materially above plan without a business decision.
  • Users are bypassing, ignoring, or heavily correcting outputs.
  • Support or incident volume is rising.
  • Regression checks are not in place for major changes.
  • No incident response process has been tested (at least one simulated incident).
  • No near-miss capture process exists.

Engineering Delivery With GRIT

GRIT is the recommended companion framework for AI-assisted engineering delivery.

AI PM Playbook and GRIT solve different parts of the same problem.

FrameworkOwns
AI PM PlaybookWhat gets built, why, what quality means, and when it is ready
GRITHow AI-assisted engineering specifies, tests, implements, reviews, hardens, and ships

GRIT is useful for PMs building with AI because AI product failure often happens in the gap between product intent and AI-generated implementation. A strong PRD can still fail if implementation happens from vague context, without failing tests, without adversarial review, or without a hardening pass.

What AI PMs Should Borrow From GRIT

1. Challenge The Premise

Before writing a PRD or build brief, answer:

PROBLEM: What problem does this solve, for whom, and how do they solve it today?
SMALLEST: What is the smallest version that proves value?
EXISTS: Does this already exist in the product, workflow, or codebase?
NULL CASE: What happens if we do nothing?

2. Use A Behavioral Contract

Before engineering starts, define:

  • What the AI feature does.
  • Inputs and outputs.
  • Good, bad, and unacceptable examples.
  • Edge cases.
  • Non-goals.
  • Human review rules.
  • Eval requirements.
  • Launch gates.
  • Open questions.

3. Require Tests And Evals Before Trust

For AI products, verification includes both software tests and product evals:

  • Unit tests for deterministic logic.
  • Integration tests for workflow behavior.
  • Eval sets for model output quality.
  • Safety tests for unacceptable outputs.
  • Regression tests for previous failures.

If a bug or quality failure is not reproduced, the fix is not yet credible.

4. Use Adversarial Review

Before launch, ask a fresh reviewer to attack:

  • Spec ambiguity.
  • Missing edge cases.
  • Weak evals.
  • Unsafe autonomy.
  • Missing fallback behavior.
  • Product risks not covered by implementation.
  • Scope that quietly expanded beyond the PRD.

5. Route Findings Back To The Right Artifact

FindingUpdate
Problem is unclearOpportunity brief or PRD
Behavior is underspecifiedAI PRD
Quality is unmeasuredEval plan
Risk is unmanagedRisk register
Implementation violates intentCode
Launch criteria are weakLaunch gate checklist

6. Add A Hardening Pass

Before production launch, review:

  • Rate limits.
  • Timeouts.
  • Fallback behavior.
  • PII handling.
  • Audit logging.
  • Permission checks.
  • Cost guardrails.
  • Graceful degradation.
  • Incident paths.
  • Regression checks.

Link to GRIT when:

  • AI agents are writing production code.
  • The feature touches auth, payments, user data, agents, integrations, or business logic.
  • The team uses Claude Code, Codex, Cursor, Windsurf, or similar tools for implementation.
  • Multiple agents or workstreams are working on one feature.
  • The team needs a repeatable review process for AI-generated pull requests.

Do not force GRIT for

  • Copy changes.
  • Small documentation edits.
  • Disposable prototypes.
  • Low-risk UI polish with no behavior change.

Example Case Studies

Examples are where the playbook becomes useful. They show concrete tradeoffs, not generic optimism.

The examples in this repo are synthetic but realistic. Names, dates, companies, metrics, and reviewers are illustrative. They are included to demonstrate how the artifacts should reason through product tradeoffs, not to claim real customer results.

Customer Support Copilot

Why it belongs

  • Common AI workflow.
  • Clear human review model.
  • Strong need for evals and observability.
  • Good example of grounding, citations, and cost.

What it demonstrates

  • AI drafts, human sends.
  • Sources are visible.
  • Low-confidence answers escalate.
  • Agent edits become quality feedback.
  • Launch recommendation is pilot candidate after eval, fallback, and observability blockers close.
  • Week-2 post-launch review shows how pilot metrics turn into operating decisions.

Sales Call CRM Assistant

Why it belongs

  • Shows workflow automation and structured output.
  • Requires user trust and editability.
  • Has privacy, consent, and data quality concerns.
  • Easy to connect to business outcomes.

What it demonstrates

  • AI summarizes calls and suggests CRM updates.
  • High-risk fields require approval.
  • Commitments and next steps are grounded in transcript evidence.
  • Success is measured by rep time saved and CRM accuracy.

Healthcare Intake Assistant

Why it belongs

  • Shows high-risk product judgment.
  • Forces strong boundaries.
  • Demonstrates when not to launch.

What it demonstrates

  • AI collects structured intake information.
  • It does not diagnose or recommend treatment.
  • Escalation rules are explicit.
  • Audit and consent requirements are clear.
  • Recommendation may be "prototype only" until safety and compliance gates are met.

Current Repo Structure

ai-pm-playbook/
  README.md
  LICENSE
  ai-pm-playbook.md

  docs/
    before-you-vibe-code.md
    00-walkthrough.md
    01-eval-design.md
    02-agentic-products.md
    03-operating-ai-products.md
    04-launch-gates.md
    05-prompt-craft.md
    06-bad-to-good-ai-prd.md
    07-error-analysis.md
    08-artifact-flow-map.md
    09-agent-pm-starter-pack.md
    10-ai-native-pm-loop.md

  templates/
    ai-opportunity-brief.md
    ai-prd.md
    ai-eval-plan.md
    human-review-workflow.md
    launch-gate-checklist.md
    ai-observability-plan.md
    ai-cost-model.md
    optional/
      ai-build-brief.md
      ai-pm-review-checklist.md
      prompt-change-record.md

  examples/
    customer-support-copilot/
      README.md
      opportunity-brief.md
      prd.md
      eval-plan.md
      cost-model.md
      launch-gate.md
      post-launch-review-week-2.md
      readiness-assessment.yaml

    sales-call-crm-assistant/
      README.md
      opportunity-brief.md
      prd.md
      eval-plan.md
      launch-gate.md
      readiness-assessment.yaml

    healthcare-intake-assistant/
      README.md
      opportunity-brief.md
      prd.md
      eval-plan.md
      launch-gate.md
      readiness-assessment.yaml

  schema/
    readiness-assessment.schema.json

License

MIT.

Link copied