Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data

Best AI papers explained - A podcast by Enoch H. Kang - Fridays

Categories:

This paper examines the limitations of using large language models (LLMs) as judges for evaluating other models, particularly at the "evaluation frontier" where new models may be better than the judge. While using LLMs as judges is a promising approach for scalable evaluation due to the cost and bottleneck of human annotation, this method introduces biases that can distort model rankings. Researchers demonstrate that existing debiasing methods, even with a small set of high-quality labels, offer limited improvement in sample efficiency when the judge model is not significantly more accurate than the evaluated model. Specifically, the maximum potential saving in ground truth data required is only a factor of two, suggesting that LLM judges cannot completely replace expert annotations for evaluating state-of-the-art models.