IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
Best AI papers explained - A podcast by Enoch H. Kang

Categories:
This paper describes IDA-Bench, a new benchmark for evaluating Large Language Models (LLMs) as interactive data analysis agents. Unlike existing benchmarks that focus on single-turn interactions, IDA-Bench assesses LLMs in multi-round dialogues with a simulated user, mirroring the iterative and subjective nature of real-world data analysis. Tasks are derived from complex Kaggle notebooks and presented as sequential natural language instructions. Initial results indicate that even advanced LLMs struggle with these multi-turn scenarios, highlighting the need to improve their instruction-following and reasoning capabilities for effective data analysis. The benchmark utilizes a sandbox environment for code execution and evaluates performance by comparing agent output to a human-derived baseline, with findings revealing different working styles and common failure modes among current LLM agents.