Qwen 2.5, RL, and Random Rewards

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

We investigate how various reward signals, even spurious and random ones, impact the performance of different language models fine-tuned for mathematical reasoning using Reinforcement Learning from Verbose Reasoning (RLVR). The research demonstrates that while Qwen models show significant improvement even with weak or incorrect rewards, this benefit is not universal, with Llama and OLMo models showing little to no gain. The study links this disparity to pre-existing reasoning patterns, particularly the Qwen models' propensity for code reasoning, suggesting that RLVR primarily amplifies existing useful behaviors rather than teaching entirely new skills. The effectiveness of random rewards in Qwen models is explored, with findings suggesting that optimization algorithm biases like clipping contribute to reinforcing high-probability, pre-existing reasoning strategies.