Async-TB: Asynchronous Trajectory Balance for Scalable LLM RL

Best AI papers explained - A podcast by Enoch H. Kang - Fridays

Categories:

This paper introduces Trajectory Balance with Asynchrony (TBA), a novel distributed reinforcement learning framework designed for efficient and scalable post-training of large language models. TBA decouples the data generation process (handled by multiple "searcher" nodes) from the policy update mechanism (managed by a single "trainer" node), utilizing an off-policy training objective called Trajectory Balance. This asynchronous approach leverages a central replay buffer to store diverse experiences generated by the searchers, allowing the trainer to continuously learn without waiting for on-policy data. The paper argues that TBA overcomes limitations of existing on-policy methods, leading to faster training times and improved performance across tasks like mathematical reasoning, preference tuning, and automated red-teaming.