Active Preference Optimization for RLHF

Best AI papers explained - A podcast by Enoch H. Kang - Fridays

Categories:

This document introduces Active Preference Optimization (APO), an algorithm designed to enhance the sample efficiency of Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). The authors highlight the costly bottleneck of collecting high-quality human preference data in current RLHF methods, which often rely on uniform sampling of prompt-generation pairs, leading to sub-optimal alignment under limited data. They demonstrate theoretically that uniform sampling can result in a constant suboptimality gap. APO addresses this by actively selecting the most informative samples to query for human feedback, formulating the problem within a contextual preference bandit framework. Theoretical analysis shows that APO achieves a significantly better suboptimality gap that scales as O(1/√T) with sample budget T, and empirical evaluations validate its superior performance and sample efficiency compared to uniform sampling baselines on practical datasets like IMDb and Anthropic-HH.