On-Policy Distillation

Training Students on Their Own Mistakes

作者：pprp 发布：2026年05月22日修订：2026年07月15日 2 min read

model compression distillation on policy llm systems

On-policy distillation (OPD) changes one crucial part of the usual teacher-student recipe: the student, not the teacher, generates the trajectories used for training. The teacher then evaluates those student-visited states with token-level distributions, sequence-level scores, verbal feedback, or another supervision signal.

That difference matters because conventional off-policy supervised fine-tuning teaches the student on clean teacher prefixes, while inference forces it to continue from its own imperfect outputs. Once the student makes an error, it may enter states that never appeared in training. OPD closes that train-inference gap by putting teacher supervision directly on the student’s own trajectories.

The core idea

An OPD method has two defining ingredients:

The student samples its own trajectory, fully or as part of a teacher-student mixture.
A teacher supplies supervision on that trajectory, from dense logits to a scalar or verbal judgment.

The compact mental model is: the student explores; the teacher scores. This makes OPD especially relevant for long-form reasoning, code generation, and agentic tasks, where small early errors can change every state that follows.

How to place OPD among SFT and RL

Method	Who generates the training path?	Main supervision	Characteristic trade-off
Off-policy SFT	Teacher or static dataset	Target tokens	Simple and stable, but exposed to train-inference mismatch
On-policy distillation	Student or mixed policy	Teacher logits, scores, or feedback	Trains on realistic student states, but requires online teacher evaluation
Reinforcement learning	Student	Reward or verifier	Optimizes outcomes directly, but often provides a sparser learning signal

OPD is therefore not merely “smaller-model imitation.” It is an interactive training loop that can sit between supervised fine-tuning and reinforcement learning, or be combined with either.

A reading map for the deck

The 37-slide deck is organized around four questions:

Why does on-policy data help? Motivation, exposure bias, the DAgger connection, and the information-density argument.
What counts as OPD? A strict definition and a six-family taxonomy covering white-box, black-box, self-distillation, iterative, OPD-RL, and speculative-decoding settings.
Which design choices matter? Forward versus reverse KL, rollout mixing, privileged context, token selection, reward shaping, and cross-tokenizer alignment.
How would you use it? Default pipelines, practical recipes, failure modes, compute and memory costs, and open research problems.

If you are new to the topic, start with slides 3-10 for the motivation and taxonomy, then slides 11-18 for the main training families. Slides 25-29 turn the survey into implementation guidance; slides 30-36 cover theory, failure modes, cost, and open problems.

Slides and full taxonomy

Open PDF Download PDF

The core idea

How to place OPD among SFT and RL

A reading map for the deck

Slides and full taxonomy

Reference