Agent Replay Evaluation Harness | DossierKit Demo Expert

Goal

Convert production task traces into offline replay evaluations so teams can understand risk when changing prompts, tool schemas, or model versions.

Design

Each sample includes user goal, context, available tools, expected intermediate behavior, final output assertions, and human fallback policy. Results are aggregated by task type, tool type, and failure reason.

Release Gate

It should be part of the platform release gate, covering changes to prompts, tool schemas, model versions, permission policy, and human handoff rules.

Lesson

Final-answer checks miss important risks. Agent evaluation must inspect process, tools, side effects, and recovery strategy.