Steering off Course: Reliability Challenges in Steering Language Models

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

We investigate the reliability of language model (LM) steering methods, which aim to modify model behavior without retraining. Researchers examined three techniques—DoLa, function vectors, and task vectors—on a wide range of LMs, finding that their effectiveness varies significantly across models and tasks. Contrary to prior research that suggested consistent performance or localization of function within models, this study reveals that these steering methods are often brittle, with assumptions about internal transformer mechanisms proving flawed and leading to performance degradation in many cases. The authors highlight the need for more rigorous evaluation of steering methods across diverse models to ensure their dependability.