Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This paper introduces an approach to optimize the computational resources used by language models (LMs) when responding to different queries. Instead of applying the same level of processing to every request, the method learns to predict how much a query would benefit from more intensive computation and then allocates resources adaptively. This is achieved by training a model to estimate the potential improvement in output quality (marginal reward) for a given input and computation budget. The research demonstrates this technique with two methods: dynamically adjusting the number of samples generated and reranked, and routing queries to either a less expensive or more capable decoding procedure. Experiments across coding, mathematics, and chat tasks show that this adaptive allocation can lead to significant computational savings or improved output quality compared to uniform resource distribution.