Abstract
Narendra-Shapiro (NS) algorithms are bandit-type algorithms developed in the 1960s. NS-algorithms have been deeply studied in infinite horizon but little non-asymptotic results exist for this type of bandit algorithms. In this paper, we focus on a non-asymptotic study of the regret and address the following question: are Narendra-Shapiro bandit algorithms competitive from this point of view? In our main result, we obtain some uniform explicit bounds for the regret of (over)-penalized-NS algorithms. We also extend to the multi-armed case some convergence properties of penalized-NS algorithms towards a stationary Piecewise Deterministic Markov Process (PDMP). Finally, we establish some new sharp mixing bounds for these processes.
| Original language | English |
|---|---|
| Pages (from-to) | 886-926 |
| Number of pages | 41 |
| Journal | Stochastics |
| Volume | 90 |
| Issue number | 6 |
| DOIs | |
| Publication status | Published - 18 Aug 2018 |
| Externally published | Yes |
Keywords
- Regret
- piecewise deterministic Markov processes
- stochastic bandit algorithms