Shallow Preference Signals: Large Language model aligns even better without truncated data?

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This academic paper investigates the phenomenon of shallow preference signals in large language models (LLMs), where the critical information for determining human preference in a response is often found in the early tokens. Experiments show that training reward models and Direct Preference Optimization (DPO) models on truncated preference datasets, using only the initial part of responses, can achieve performance comparable to or even better than using full datasets, suggesting efficiency gains are possible. The research also explores how decoding strategies can leverage this observation to improve the trade-off between alignment and computational cost, while acknowledging the limitation that current alignment methods may only result in shallow alignment that focuses on initial tokens rather than the full response for true human value alignment.