Readiness assessment (YAML)

Customer Support Copilot
readiness-assessment.yamlYAML
product:
  name: "Customer Support Copilot"
  stage: "prototype"
  owner: "Maya Patel, PM"
  target_users:
    - "Tier 1 support agents"

use_case:
  problem: "Agents spend 38% of handle time drafting replies from scattered knowledge base articles. Top 50 intents cover 62% of ticket volume."
  ai_job: "Draft support responses using approved knowledge base articles for Tier 1 agents to review and send, subject to human approval before any customer-facing message."
  non_ai_alternative: "Agents search knowledge base manually, read articles, and write responses from scratch. Templates were tried and failed (adoption dropped from 22% to 11%)."
  expected_outcome: "Reduce drafting time from 3.1 minutes to under 1.5 minutes for supported intents. 75% draft accept rate. Zero hallucinated policy claims."

dimensions:
  problem_fit:
    score: 4
    evidence:
      - "[T1] 4-week time study across 12 agents and 2,400 tickets confirms 38% of handle time on drafting"
      - "[T1] Top 50 intents cover 62% of volume; top 10 cover 31%"
      - "[T1] Template approach tried and failed: adoption dropped from 22% to 11% over 3 months"
      - "[T2] Agent surveys cite repetitive drafting as top frustration"
    risks:
      - "Benefit will be uneven across agents. Some draft faster than average."
      - "Less useful for complex, low-volume tickets outside top 50 intents"
    owner: "PM"
    next_action: "Segment expected impact by intent tier to set realistic pilot expectations"

  workflow_fit:
    score: 4
    evidence:
      - "[T3] Clear human-in-the-loop: AI drafts, agent reviews, agent sends"
      - "[T3] Integrates into existing support platform reply panel"
      - "[T3] Agent can accept, edit, reject, or escalate at every step"
    risks:
      - "Slow drafts (>4s) may cause agents to ignore the feature entirely"
      - "High-pressure shifts may lead to rubber-stamping without careful review"
    owner: "Design"
    next_action: "Add draft loading indicator and measure time-to-first-interaction"

  ai_job_definition:
    score: 4
    evidence:
      - "[T3] AI job statement is specific and testable"
      - "[T3] Input contract, output contract, and autonomy level defined"
      - "[T3] Clear boundary: AI suggests, human decides and sends"
      - "[T3] Scope limited to top 10 intents for v1"
    risks:
      - "Multi-intent tickets (18% of volume) handling not fully specified"
      - "Confidence threshold of 0.6 is assumed, not empirically validated"
    owner: "ML Lead"
    next_action: "Run prototype against 50 multi-intent tickets and decide handling strategy"

  data_readiness:
    score: 3
    evidence:
      - "[T1] Knowledge base has 400+ articles covering all supported intents"
      - "[T1] 6 months of historical ticket data available for eval set creation"
      - "[T1] Vector embedding pipeline for KB retrieval functional in staging"
    risks:
      - "15% of KB articles are outdated based on spot check"
      - "No automated freshness check or deprecation flag on articles"
      - "Article quality is inconsistent: some clear, some rambling"
    owner: "Support Lead + Engineering"
    next_action: "Audit and update KB articles for top 10 intents. Flag deprecated articles in retrieval index."

  eval_readiness:
    score: 2
    evidence:
      - "[T3] Quality rubric drafted with 4 dimensions"
      - "[T3] 5 golden examples defined covering required categories"
      - "[T3] Automated checks specified: citation validity, PII detection, length bounds"
    risks:
      - "No golden dataset built yet. 5 examples are illustrative, not a scored eval set."
      - "No labeled reference responses from graders"
      - "Regression suite has only 5 cases, target is 50"
      - "Inter-rater reliability untested"
    owner: "ML Lead + Support Lead"
    next_action: "Build 100-example golden eval set with labeled reference responses. Run inter-rater reliability test."

  system_behavior:
    score: 3
    evidence:
      - "[T3] Model selected (Claude Sonnet) with estimated token costs"
      - "[T1] Retrieval pipeline functional in staging"
      - "[T3] Latency targets defined: p50 <2s, p95 <4s, timeout 6s"
      - "[T3] Timeout fallback behavior specified"
    risks:
      - "Low-confidence fallback not implemented. Prototype shows draft regardless of score."
      - "Safety trigger detection not tested against adversarial inputs"
      - "No load testing done"
    owner: "Engineering"
    next_action: "Implement low-confidence fallback. Test safety triggers. Run load test at 2x expected usage."

  risk_and_safety:
    score: 3
    evidence:
      - "[T3] Main risks identified: hallucination, outdated citations, PII echo, safety-sensitive tickets"
      - "[T3] Mitigation strategy defined for each risk"
      - "[T3] Human review on every response is a strong backstop"
    risks:
      - "No formal risk register with severity ratings"
      - "PII filtering relies on regex, will miss some formats"
      - "No incident response process for hallucination reports"
    owner: "PM + Engineering"
    next_action: "Create formal risk register. Define incident response process. Test PII filter against 50 varied examples."

  regulatory_readiness:
    score: 2
    evidence:
      - "[T3] Customer-facing AI disclosure requirement identified"
      - "[T3] PII handling risks listed in PRD and risk notes"
      - "[T3] Human review required before any customer-facing message is sent"
    risks:
      - "Prompt and output retention policy not confirmed."
      - "Customer-facing AI disclosure language not reviewed by legal."
      - "Tenant data handling review not completed."
    owner: "Legal + Security"
    next_action: "Confirm disclosure, retention, and tenant data handling requirements before pilot."

  cost_and_business_case:
    score: 3
    evidence:
      - "[T3] Estimated cost per draft: $0.0094 baseline, $0.03 upper-bound target"
      - "[T3] Monthly cost for full team: $51 baseline for top-10-intent v1, about $103 if coverage expands to top 50 intents"
      - "[T5] Expected 20% handle time reduction on supported intents"
    risks:
      - "Cost estimate based on assumed token counts, not measured"
      - "Business case relies on handle time reduction not yet measured in practice"
      - "Expansion beyond top 10 intents depends on quality and KB coverage, not just cost"
    owner: "PM + Engineering"
    next_action: "Measure actual token usage on 200 prototype runs. Validate supported-intent volume during pilot and model cost at 50-intent scenario."

  observability:
    score: 2
    evidence:
      - "[T1] Latency logging live in staging"
      - "[T1] API cost tracking functional"
      - "[T3] Metric definitions exist for accept/edit/reject/retry/escalation"
    risks:
      - "Accept/edit/reject tracking not implemented"
      - "Citation click tracking not implemented"
      - "No dashboard exists"
      - "No alerting configured"
    owner: "Engineering"
    next_action: "Implement accept/edit/reject event logging. Build dashboard with 8 key metrics. Configure alerts."

  launch_and_operations:
    score: 2
    evidence:
      - "[T4] Pilot group identified: 8 agents on day shift"
      - "[T5] Pilot duration planned: 2 weeks"
      - "[T5] Post-launch review cadence mentioned in eval plan"
    risks:
      - "No written pilot plan with success criteria or rollback triggers"
      - "No agent training plan or materials"
      - "No go/no-go decision process defined"
      - "Post-launch review owner not assigned"
    owner: "PM + Support Lead"
    next_action: "Write pilot plan with success criteria and rollback triggers. Create agent training session. Assign review owner."

recommendation:
  level: "pilot_candidate"
  weighted_score: 2.86
  rationale: "The problem, workflow, and AI job are strong enough to justify a pilot, but the team should not start that pilot until eval, observability, regulatory review, and fallback blockers close."
  blockers_before_pilot:
    - "Build 100-example golden eval set with labeled reference responses"
    - "Implement accept/edit/reject event logging and build basic dashboard"
    - "Implement low-confidence fallback behavior"
    - "Confirm disclosure, retention, and tenant data handling requirements"
  blockers_before_production:
    - "Pilot runs 2+ weeks with online metrics meeting targets"
    - "No confirmed hallucination reports during pilot"
    - "Agent feedback is net positive"
    - "Formal risk register with incident response process"
    - "250-example eval set passes at pilot thresholds"
    - "Load testing at 2x expected concurrent usage"
    - "Safety triggers tested against adversarial inputs"
  alternative_recommendation: "Invest in knowledge base quality and search UX. Better KB search with snippet previews may capture 40-60% of the drafting time savings without model risk. Evaluate whether this closes enough of the gap before committing to an LLM-based copilot."
PreviousLaunch gate NextPost-launch review (week 2)