\( \texttt{CritiQ} \) and \(\texttt{ReTRy}\) strategically mitigate information asymmetry in teacher-student policy distillation by selective querying and resetting to recoverable trajectories.

Abstract

We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access. A key challenge is \(\textit{information asymmetry}\): the student cannot directly access the teacher’s state space, leading to distributional shifts and policy degradation. Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently, both of which are inefficient. Our key insight is that the student should strategically interact with the teacher -querying only when necessary and resetting from recovery states -to stay on a recoverable path within its own observation space. We introduce two methods: (i) an imitation learning approach that adaptively determines \(\textit{when}\) the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects \(\textit{where}\) to initialize training for efficient exploration. We validate our methods in both simulated and real-world robotic tasks, demonstrating significant improvements over standard teacher-student baselines in training efficiency and final performance.

\( \texttt{CritiQ} \) Overview

\( \texttt{CritiQ} \) queries the teacher only at critical states —states where the student is about to take an action with no recoverable supervision (e.g., choosing an incorrect box). This ensures that the student receives necessary correction data while avoiding excessive aliasing.

\(\texttt{ReTRy}\) Overview

\(\texttt{ReTRy}\) iteratively refines the reset distribution. Instead of resetting only to teacher-visited states, \(\texttt{ReTRy}\) rolls out the teacher from states visited by the student, augmenting the set of reset states.

Experiments

We tested our algorithms in three policy distillation tasks under information asymmetry.

Finding 1: Selective querying and refining reset distribution enable task success.

Our experiments demonstrate that the proposed methods perform better compared to baselines in all tasks. While both \(\texttt{CritiQ}\) and \(\texttt{ReTRy}\) showed \(100\%\) of task completion for most tasks, \(\texttt{CritiQ}\) suffers from compounding error in long-horizon task with high precision is required (e.g., the drawer task), whereas \(\texttt{ReTRy}\) showed \(100\%\) task completion in all tasks.

Finding 2: Every algorithm has unique patterns across all tasks.

\(\texttt{BC}\) : Tries one of the demonstrations the teacher showed, leading to a success rate of \(\frac{1}{C}\), where \(C\) is the number of possible cases of hidden information.
\(\texttt{DAgger}\): Ends up oscillating due to augmenting conflicting teacher actions for the same state.
\(\texttt{CritiQ}\): Seaches over multiple possible cases.
\(\texttt{ReTRy}\): Seaches over all possible cases.

Finding 3: Strategic interaction with teacher demonstration enables exploratory behavior.

For the drawer-finding task, data augmentation is performed every \(k\) epoch for \(\texttt{CritiQ}\) and twice throughout the training time for \(\texttt{ReTRy}\). The dataset for \(\texttt{BC}\) remains the same, while \(\texttt{DAgger}\) aggregates data every step. The figure on the left shows the exploratory search level of each algorithm over the training time. The results indicate that \(\texttt{CritiQ}\) reached a medium seach level within 100 minutes of training, whereas \(\texttt{ReTRy}\) reached a medium search level around 400 minutes and a high seach level around 500 minutes. While \(\texttt{BC}\) converges to a low-level search behavior within 20 minutes, \(\texttt{DAgger}\)'s search level initially reaches a low level but then diverges to None due to indiscreet data augmentation. This result shows that strategical interaction with teacher for data augmentation is important in achieving exploratory behavior.

Distilling Realizable Students from Unrealizable Teachers