L1: Length Controlled Reasoning with Reinforcement Learning
Best AI papers explained - A podcast by Enoch H. Kang - Tuesdays

Categories:
This research paper introduces Length Controlled Policy Optimization (LCPO), a reinforcement learning technique that enables reasoning language models to control the length of their generated thought processes based on user-specified constraints. By training a model called L1 with LCPO, the authors demonstrate precise management of reasoning length, allowing for a trade-off between computational cost and accuracy on various tasks. Notably, L1 outperforms prior length control methods and exhibits strong generalization to new tasks. Furthermore, the study reveals that models trained for longer reasoning can surprisingly excel at shorter reasoning tasks, even surpassing significantly larger models at comparable token budgets, suggesting a new approach to efficient and scalable reasoning.