Multi-Turn Code Generation Through Single Step Rewards

1Mila-Quebec AI Institute, 2Université de Montréal, 3Cornell University

\( \mu \texttt{Code} \) is a simple and scalable method for multi-turn code generation leveraging learned verifiers.


Abstract

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, \( \mu \texttt{Code} \), that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. \( \mu \texttt{Code} \) iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art generator-verifier baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of \( \mu \texttt{Code} \) at utilizing the execution feedback.


\( \mu \texttt{Code} \) Overview

\( \mu \texttt{Code} \) follows an expert iteration framework with a local search expert via a learned verifier. \( \mu \texttt{Code} \) iteratively trains two components - 1) a learned verifier to score responses and 2) a generator to produce code solutions by imitating the local search.


Multi-turn Best-of-N Search at Test-time

At inference time, \( \mu \texttt{Code} \) improves solutions over successive turns with multi-turn Best-of-N (BoN). The generator produces N candidate solutions which are ranked via the verifier (orange representing the chosen solution). The retrieved solution is tested on public tests and the execution feedback is utilized to improve this solution. Our learned verifier gives high scores to solutions with less errors (highlighted in yellow).


Results

Our experiments demonstrate that all multi-turn methods perform better with the proposed multi-turn BoN approach. The Best-of-N (BoN) accuracy is computed with N=5 solutions where the combination of public tests and the learned verifier is used for selection. Our proposed method \( \mu \texttt{Code} \) outperforms competing methods.




We conduct a component wise ablation study where we:
1) Compare public tests (PT) and learned verifier (LV) for multi-turn BoN search at test-time (left)
2) Evaluate generators at incorporating execution feedback where \( \mu Code \) consistently improvess performance (top right)
3) Assess scaling behaviors at inference time with number of candidate generations (N) at each turn (bottom right)


Paper

BibTex


@article{jain2025multi,
    title={Multi-Turn Code Generation Through Single Step Rewards},
    author={Arnav Kumar Jain and Gonzalo Gonzalez-Pumariega and Wayne Chen and Alexander M Rush and Wenting Zhao and Sanjiban Choudhury},
    journal={CoRR},
    volume={abs/2502.20380},
    year={2025}
}