Writing

On-Policy Distillation

Training Students on Their Own Mistakes

作者:pprp 发布:2026年05月22日 修订:2026年05月22日 1 min read

This slide deck introduces On-Policy Distillation (OPD): a family of distillation methods where the student learns from trajectories, states, prompts, or mistakes that arise from its own policy rather than only imitating static teacher data.

It walks through the motivation, definition, taxonomy, white-box and black-box settings, on-policy self-distillation, OPD-RL hybrids, speculative-decoding distillation, practical recipes, failure modes, cost tradeoffs, and open problems.

Open PDF Download PDF

Reference

@misc{dong2026opd,
    author = {Dong, Peijie},
    title = {On-Policy Distillation: Training Students on Their Own Mistakes},
    year = {2026},
    month = may,
    day = {22},
    howpublished = {\url{https://pprp.github.io/tech/opd/}},
    url = {https://pprp.github.io/tech/opd/},
    urldate = {2026-05-22},
    note = {Blog post with PDF slides. Accessed: 2026-05-22},
    language = {English}
}