PMAI PM Playbook

AI observability plan

Core template

Use this to define what you monitor after launch. Set this up before shipping, not after the first incident.

Upstream: metric targets come from the AI PRD and cost model. Downstream: production signals trigger updates to the eval plan, launch gates, and roadmap decisions.

Product signals

MetricDefinitionTargetAlert thresholdHow measured
Adoption% of eligible users who triggered the AI feature this weeke.g., 40% by week 4e.g., drops below 20%e.g., event tracking in product analytics
Task completion% of AI-initiated tasks where the user reached their goale.g., >= 75%e.g., < 60% for 3 consecutive days
Accept rate% of AI outputs accepted without editse.g., >= 65%e.g., < 50%
Edit rate% of AI outputs accepted after user editse.g., < 25%e.g., > 35%
Reject rate% of AI outputs rejected entirelye.g., < 15%e.g., > 25%
Retry rate% of tasks where user retried after initial outpute.g., < 10%e.g., > 20%
Escalation rate% of tasks escalated to a human or supporte.g., < 5%e.g., > 10%
User-reported issuesCount of bug reports or complaints about AI outpute.g., < 3/weeke.g., > 5 in a day

System signals

MetricDefinitionTargetAlert thresholdHow measured
Latency (p50)Median response time for AI taske.g., < 2se.g., > 4s
Latency (p95)95th percentile response timee.g., < 6se.g., > 12s
Cost per taskAverage model + retrieval cost per AI invocatione.g., $0.04e.g., > $0.08
Cost per customer/monthTotal AI cost attributed per paying customere.g., < $1.50e.g., > $3.00
Token usage per requestAverage input + output tokens per AI calle.g., 2,000 input, 500 outpute.g., > 2x baseline
Error rate% of requests returning errors, timeouts, or malformed responsese.g., < 1%e.g., > 2% sustained 1 hour
Quality scoreAutomated eval score on sampled production outputse.g., >= 90%e.g., < 85%
Regression failuresCount of outputs that fail regression suite on production datae.g., 0e.g., > 2 in a week

Safety signals

SignalDefinitionTargetAlert thresholdResponse
PII leakageOutputs containing detected PII patterns0Any single occurrencee.g., immediate investigation, disable feature if confirmed
Prompt injection attemptsInputs matching known injection patternsInformationale.g., > 10/day from single user
Content policy violationsOutputs flagged by content safety filters0Any single occurrence
Out-of-scope actionsAgent takes action outside defined boundaries0Any single occurrence

Version tracking

ComponentCurrent versionLast changedChange log
Modele.g., Claude Sonnet 4.6datelink to change record
System prompte.g., v2.3datelink to diff
Retrieval pipelinee.g., v1.1date
Eval suitee.g., 100 examples, last run datedate

Drift detection

  • Eval cadence: e.g., run golden set weekly against production, compare to baseline
  • Output sampling: e.g., log and store 10% of production outputs for manual review
  • Distribution monitoring: e.g., track output length, confidence score distribution, and vocabulary patterns week over week
  • Trigger for investigation: e.g., any sustained metric movement > 5% from baseline over 3+ days

Tracing and debugging

  • Trace ID: e.g., every request gets a unique trace ID that links input, retrieval results, model call, output, and user action
  • Full request logging: e.g., log 100% of requests for first 2 weeks, then sample 10% at steady state
  • Retention period: e.g., 90 days for full traces, 1 year for aggregate metrics
  • Privacy constraints: e.g., PII scrubbed before storage, access restricted to on-call engineers

Trace review loop

QuestionAnswer
Who reviews traces?
Sampling cadencee.g., daily first 2 weeks, weekly after
Sample sizee.g., 20-50 traces per review
Sampling methodrandom, high-risk only, rejected outputs, high-cost traces, escalations
What gets labeled?output quality, retrieval quality, tool use, handoff, priority, cost, safety
Where labels goeval set, PRD risk table, launch gate, support process
Eval update triggere.g., recurring failure appears 3+ times or any high-severity failure
Trace categoryWhat to inspectAction if recurring
Rejected or ignored outputsDid the AI miss intent, evidence, tone, or policy?Add eval case or change workflow
Edited outputsWhat did the human correct?Add correction pattern to golden set
EscalationsShould the AI have escalated earlier?Update handoff rule or eval
High-cost tracesDid retries, context, or tools inflate cost?Add cost guardrail
Safety flagsDid guardrails or review catch it?Block launch or update risk mitigation

Alert routing

SeverityResponse timeWho gets pagedExample
Criticale.g., 15 mine.g., on-call engineer + PMe.g., PII leakage, safety failure, data breach
Highe.g., 1 houre.g., on-call engineere.g., quality score below threshold, error rate spike
Mediume.g., next business daye.g., PMe.g., accept rate declining, cost trending up
Lowe.g., next review meetinge.g., PMe.g., adoption below target, minor latency increase

Review cadence

  • Daily (first 2 weeks post-launch): check product signals, review any alerts
  • Weekly: sample 20 outputs for manual review, review cost trends
  • Monthly: full eval re-run on production data, update quality baselines
  • On every model/prompt change: run regression suite before deploy
  • After any incident: add failure case to eval set, review related signals for missed warning signs
  • After any usage spike: review cost and latency impact, check for degradation under load

Weekly post-launch review

Feature: name Week of: YYYY-MM-DD Author: name

What changed

Metrics summary

MetricThis weekLast weekTargetStatus
Active users
Tasks completed
Accept rate
Edit rate
Reject rate
Retry rate
Escalation rate
Cost per task
p95 latency
Automated eval score
Manual review score

Incidents and near misses

DateDescriptionSeverityResolutionFollow-up

Top failure modes

Decisions and next actions

Decision or actionOwnerDueReversal or review trigger
Link copied