TY - GEN
T1 - Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees
AU - Tiapkin, Daniil
AU - Belomestny, Denis
AU - Calandriello, Daniele
AU - Moulines, Éric
AU - Munos, Remi
AU - Naumov, Alexey
AU - Rowland, Mark
AU - Valko, Michal
AU - Ménard, Pierre
N1 - Publisher Copyright:
© 2022 Neural information processing systems foundation. All rights reserved.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon H with S states, and A actions. The performance of an agent is measured by the regret after interacting with the environment for T episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in H, S, A, and T per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most (Equation presented) ignoring poly log(HSAT) terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order (Equation presented), thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.
AB - We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon H with S states, and A actions. The performance of an agent is measured by the regret after interacting with the environment for T episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in H, S, A, and T per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most (Equation presented) ignoring poly log(HSAT) terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order (Equation presented), thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.
UR - https://www.scopus.com/pages/publications/85163195911
M3 - Conference contribution
AN - SCOPUS:85163195911
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
A2 - Koyejo, S.
A2 - Mohamed, S.
A2 - Agarwal, A.
A2 - Belgrave, D.
A2 - Cho, K.
A2 - Oh, A.
PB - Neural information processing systems foundation
T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
Y2 - 28 November 2022 through 9 December 2022
ER -