RESTARTED BAYESIAN ONLINE CHANGE-POINT DETECTION FOR NON-STATIONARY MARKOV DECISION PROCESSES

Research output: Contribution to journalConference articlepeer-review

Abstract

We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of O ( DO q ATKT log ( Tδ ) + min KL(θ(+1) ∥ θ() ) , where D is the largest MDP diKT log KδT ℓ ameter from the set of MDPs defining the piecewise stationary MDP setting, O is the finite number of states (constant over all changes), A is the finite number of actions (constant over all changes), KT is the number of change points up to horizon T, and θ() is the transition kernel during the interval [c, c+1), which we assume to be multinomially distributed over the set of states O. Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes. In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments.

Original languageEnglish
Pages (from-to)715-744
Number of pages30
JournalProceedings of Machine Learning Research
Volume232
Publication statusPublished - 1 Jan 2023
Event2nd Conference on Lifelong Learning Agents, CoLLA 2023 - Montreal, Canada
Duration: 22 Aug 202325 Aug 2023

Fingerprint

Dive into the research topics of 'RESTARTED BAYESIAN ONLINE CHANGE-POINT DETECTION FOR NON-STATIONARY MARKOV DECISION PROCESSES'. Together they form a unique fingerprint.

Cite this