Tight regret bounds for model-based reinforcement learning with greedy policies

  • Yonathan Efroni
  • , Nadav Merlis
  • , Mohammad Ghavamzadeh
  • , Shie Mannor

Research output: Contribution to journalConference articlepeer-review

Abstract

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon undiscounted MDP setting and establish that exploring with greedy policies - act by 1-step planning - can achieve tight minimax performance in terms of regret, Õ(vHSAT). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

Original languageEnglish
JournalAdvances in Neural Information Processing Systems
Volume32
Publication statusPublished - 1 Jan 2019
Externally publishedYes
Event33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, Canada
Duration: 8 Dec 201914 Dec 2019

Fingerprint

Dive into the research topics of 'Tight regret bounds for model-based reinforcement learning with greedy policies'. Together they form a unique fingerprint.

Cite this