Active Learning for Direct Preference Optimization

Best AI papers explained - A podcast by Enoch H. Kang - Fridays

Categories:

This document explores active learning strategies for Direct Preference Optimization (DPO), a method for aligning large language models (LLMs) with human preferences by directly optimizing the policy based on feedback. The authors propose a framework and two algorithms, ADPO and ADPO+, designed for both online collection of new feedback and offline selection from existing feedback, aiming to efficiently choose the most informative preferences. Their approach linearizes the DPO objective at the final neural network layer and applies D-optimal design principles to guide feedback collection, offering a theoretical analysis demonstrating that logit estimation errors decrease with more feedback. Empirical results on both simulated log-linear policies and real-world LLMs suggest that these active learning methods effectively improve model performance by selecting better training data.