TY - GEN
T1 - Combinatorial Bandits for Sequential Learning in Colonel Blotto Games
AU - Vu, Dong Quan
AU - Loiseau, Patrick
AU - Silva, Alonso
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12/1
Y1 - 2019/12/1
N2 - The Colonel Blotto game is a renowned resource allocation problem with a long-standing literature in game theory (almost 100 years). In this work, we propose and study a regret-minimization model where a learner repeatedly plays the Colonel Blotto game against several adversaries. At each stage, the learner distributes her budget of resources on a fixed number of battlefields to maximize the aggregate value of battlefields she wins; each battlefield being won if there is no adversary that has higher allocation. We focus on the bandit feedback setting. We first observe that it can be modeled as a path planning problem. It is then possible to use the classical ComBand algorithm to guarantee a sub-linear regret in terms of time horizon, but this entails two fundamental challenges: (i) the computation is inefficient due to the huge size of the action set, and (ii) the standard exploration distribution leads to a loose guarantee in practice. To address the first, we construct a modified algorithm that can be efficiently implemented by applying a dynamic programming technique called weight pushing; for the second, we propose methods optimizing the exploration distribution to improve the regret bound.
AB - The Colonel Blotto game is a renowned resource allocation problem with a long-standing literature in game theory (almost 100 years). In this work, we propose and study a regret-minimization model where a learner repeatedly plays the Colonel Blotto game against several adversaries. At each stage, the learner distributes her budget of resources on a fixed number of battlefields to maximize the aggregate value of battlefields she wins; each battlefield being won if there is no adversary that has higher allocation. We focus on the bandit feedback setting. We first observe that it can be modeled as a path planning problem. It is then possible to use the classical ComBand algorithm to guarantee a sub-linear regret in terms of time horizon, but this entails two fundamental challenges: (i) the computation is inefficient due to the huge size of the action set, and (ii) the standard exploration distribution leads to a loose guarantee in practice. To address the first, we construct a modified algorithm that can be efficiently implemented by applying a dynamic programming technique called weight pushing; for the second, we propose methods optimizing the exploration distribution to improve the regret bound.
UR - https://www.scopus.com/pages/publications/85082440465
U2 - 10.1109/CDC40024.2019.9029186
DO - 10.1109/CDC40024.2019.9029186
M3 - Conference contribution
AN - SCOPUS:85082440465
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 867
EP - 872
BT - 2019 IEEE 58th Conference on Decision and Control, CDC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 58th IEEE Conference on Decision and Control, CDC 2019
Y2 - 11 December 2019 through 13 December 2019
ER -