Reward Shaping from Confounded Offline Data

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This academic paper explores a novel technique for automatic reward shaping in reinforcement learning, specifically addressing the challenge of learning from offline data that may contain unobserved confounding factors. The authors propose using causal state value upper bounds, derived from this confounded data, as potential functions for Potential-Based Reward Shaping (PBRS). They demonstrate theoretically and through simulations that their method, when applied to a model-free learner like Q-UCB, leads to improved learning efficiency and a better regret bound compared to approaches without this causal shaping. The work focuses on Confounded Markov Decision Processes (CMDPs), explicitly modeling hidden confounders, and introduces a new algorithm for learning these robust potential functions from offline data, aiming to accelerate the learning of optimal policies for online agents.