DPO
DPO for Tool-use Preference
Goal
Validate whether production tool-use failures can be systematically converted into chosen/rejected preference pairs and used for post-training or tool-policy optimization.
Sample design
Each sample should keep the task goal, context, available tools, business constraints, chosen behavior, rejected behavior, failure reason, and human review result. Sources can include production badcases, crafted boundary samples, and replayed high-frequency workflows.
Evaluation metrics
Do not rely only on preference win-rate. Enterprise Agent evaluation should also inspect tool selection accuracy, parameter validity, task success, refusal quality, human handoff, and potential side effects.
To add later
When real project evidence is available, replace this section with model version, sample scale, training/evaluation setup, baseline, measured results, failure samples, and reusable lessons.