Replay Eval
Agent Replay Evaluation Harness
Built reproducible Agent regression evaluation from production traces for prompt, tool, and model changes.
Goal
Convert production task traces into offline replay evaluations so teams can understand risk when changing prompts, tool schemas, or model versions.
Design
Each sample includes user goal, context, available tools, expected intermediate behavior, final output assertions, and human fallback policy. Results are aggregated by task type, tool type, and failure reason.
Release Gate
It should be part of the platform release gate, covering changes to prompts, tool schemas, model versions, permission policy, and human handoff rules.
Lesson
Final-answer checks miss important risks. Agent evaluation must inspect process, tools, side effects, and recovery strategy.