Back to list

Framework

Agent Data Flywheel and Post-training Interface Framework

Organizing production badcases, evaluation samples, preference data, and training interfaces into a continuous improvement loop.
  • Preference Data
  • DPO
  • Evaluation

What this case should prove

This case should show that Agent quality is not improved by prompt tweaks alone. Production failures need to become reviewable, evaluable, labelable, and trainable data assets. It supports post-training data loops, evaluation systems, LLMOps, and Agent platform leadership roles.

The positioning is not "foundation model trainer." It is the bridge between application feedback and model-side improvement.

Badcase taxonomy framework

Useful categories include:

  • Tool selection errors: wrong tool, missing tool, repeated tool calls.
  • Parameter errors: field mapping, units, time range, or permission parameters.
  • Task decomposition errors: poor plan order, missing subtasks, or failure recovery gaps.
  • Knowledge and context errors: missing retrieval, polluted context, or long-context forgetting.
  • Business rule errors: workflow, approval, compliance, risk, or customer policy violations.
  • Output quality errors: hallucination, poor refusal, formatting errors, or unactionable advice.

Data loop framework

A mature loop can be described as:

  1. Trace capture: user goal, context, tools, parameters, intermediate states, and final result.
  2. Failure attribution: standardize badcase categories across model, tool, data, product, and workflow issues.
  3. Sample routing: send samples into replay evals, golden sets, preference pairs, or manual review pools.
  4. Labeling rules: define chosen/rejected, reason tags, risk levels, and business constraints.
  5. Training interface: feed SFT, DPO, preference optimization, or prompt/tool-schema improvements.
  6. Release gates: use replay sets and online metrics to confirm actual improvement.

Evidence to add later

When real project details are ready, add:

  • Badcase volume, taxonomy size, labeling roles, and quality-control process.
  • Evaluation set size, task coverage, replay pass rate, and regressions caught.
  • Impact after the data entered prompt, tool schema, RAG, model routing, or training experiments.
  • Collaboration workflow with ML, product, and engineering teams.

Interview angle

The core message: you can turn production Agent failures into a shared data language for product, engineering, and ML teams, making quality improvement sample-driven, measurable, and release-gated.