Theoretical guarantees on the best-of-n alignment policy

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This paper critically examines the best-of-n policy, a common method for aligning generative language models by selecting the highest-reward sample from $n$ options drawn from a reference policy. It disproves a widely-used analytical formula for the KL divergence between the best-of-n policy and the reference, proving that the formula is only an upper bound. The authors analyze the conditions under which this bound is tight or loose and propose a new, more accurate estimator for the KL divergence. Additionally, they analyze the win rate of the best-of-n policy against the reference, providing both upper and lower bounds, and compare best-of-n to another rejection sampling method, rewind-and-repeat, showing best-of-n's superior trade-offs between win rate and KL divergence.