Abstract
We study how to learn ε-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially, based on their observations over a fixed number of episodes denoted by T. As noted by Steinberger et al. (2020) and McAleer et al. (2022), most existing procedures suffer from high variance due to the use of importance sampling over sequences of actions. To reduce this variance, we consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy. Our approach is based on an adaptive Online Mirror Descent (OMD) algorithm that applies OMD locally to each information set, using individually decreasing learning rates and a regularized loss. We show that this approach guarantees a convergence rate of Õ(T−1/2) with high probability and has a near-optimal dependence on the game parameters when applied with the best theoretical choices of learning rates and sampling policies. To achieve these results, we generalize the notion of OMD stabilization, allowing for time-varying regularization with convex increments.
| Original language | English |
|---|---|
| Journal | Advances in Neural Information Processing Systems |
| Volume | 37 |
| Publication status | Published - 1 Jan 2024 |
| Event | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada Duration: 9 Dec 2024 → 15 Dec 2024 |
Fingerprint
Dive into the research topics of 'Local and Adaptive Mirror Descents in Extensive-Form Games'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver