Interpreting Emergent Planning in Model-Free Reinforcement Learning
Best AI papers explained - A podcast by Enoch H. Kang

Categories:
This paper presents research exploring whether a model-free reinforcement learning agent, specifically a DRC agent playing the game Sokoban, learns to plan. Through a concept-based interpretability methodology involving probing for planning-relevant concepts like future agent and box movements, investigating how plans are formed internally, and verifying the causal link between internal representations and behavior through interventions, the authors provide mechanistic evidence of emergent planning. They demonstrate that the agent forms internal plans resembling parallelized bidirectional search, showing how it evaluates and adapts these plans. The study also links the emergence of this planning ability with the agent's improved performance when given additional computation time and explores the findings in different agent architectures and a different environment, Mini PacMan.