Back to list

DPO

DPO for Tool-use Preference

A framework for turning tool-use badcases into preference data and post-training experiment interfaces.
  • DPO
  • Tool-use
  • Preference Data

Goal

Validate whether production tool-use failures can be systematically converted into chosen/rejected preference pairs and used for post-training or tool-policy optimization.

Sample design

Each sample should keep the task goal, context, available tools, business constraints, chosen behavior, rejected behavior, failure reason, and human review result. Sources can include production badcases, crafted boundary samples, and replayed high-frequency workflows.

Evaluation metrics

Do not rely only on preference win-rate. Enterprise Agent evaluation should also inspect tool selection accuracy, parameter validity, task success, refusal quality, human handoff, and potential side effects.

To add later

When real project evidence is available, replace this section with model version, sample scale, training/evaluation setup, baseline, measured results, failure samples, and reusable lessons.