Skip to main navigation Skip to search Skip to main content

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

  • Daniil Tiapkin
  • , Denis Belomestny
  • , Daniele Calandriello
  • , Éric Moulines
  • , Remi Munos
  • , Alexey Naumov
  • , Mark Rowland
  • , Michal Valko
  • , Pierre Ménard
  • National Research University
  • University of Duisburg-Essen
  • DeepMind Technologies Limited
  • École Polytechnique
  • Ecole Normale Supérieure de Lyon

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon H with S states, and A actions. The performance of an agent is measured by the regret after interacting with the environment for T episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in H, S, A, and T per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most (Equation presented) ignoring poly log(HSAT) terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order (Equation presented), thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
EditorsS. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
PublisherNeural information processing systems foundation
ISBN (Electronic)9781713871088
Publication statusPublished - 1 Jan 2022
Externally publishedYes
Event36th Conference on Neural Information Processing Systems, NeurIPS 2022 - New Orleans, United States
Duration: 28 Nov 20229 Dec 2022

Publication series

NameAdvances in Neural Information Processing Systems
Volume35
ISSN (Print)1049-5258

Conference

Conference36th Conference on Neural Information Processing Systems, NeurIPS 2022
Country/TerritoryUnited States
CityNew Orleans
Period28/11/229/12/22

Fingerprint

Dive into the research topics of 'Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees'. Together they form a unique fingerprint.

Cite this