Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This paper presents research exploring adaptive inference-time compute for large language models (LLMs) to enhance performance and efficiency. The core concept involves training LLMs to perform capability-aware and mid-generation self-evaluations, allowing them to predict whether restarting a response would yield a better result without needing external reward models. The paper demonstrates two key techniques leveraging this capability: adaptive sampling, which resamples only when predicted as beneficial, and early pruning, which stops unpromising responses during generation. The findings show that these methods can achieve significant performance improvements, such as increasing the win rate against GPT-4 on AlpacaEval and boosting accuracy on GSM8K math problems, while substantially reducing the average number of samples and tokens generated.