MOTS: Minimax Optimal Thompson Sampling
release_ats7kkvffnhhlbjnotbv77ihaa
by
Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, Quanquan Gu
2020
Abstract
Thompson sampling is one of the most widely used algorithms for many online
decision problems, due to its simplicity in implementation and superior
empirical performance over other state-of-the-art methods. Despite its
popularity and empirical success, it has remained an open problem whether
Thompson sampling can achieve the minimax optimal regret O(√(KT)) for
K-armed bandit problems, where T is the total time horizon. In this paper,
we solve this long open problem by proposing a new Thompson sampling algorithm
called MOTS that adaptively truncates the sampling result of the chosen arm at
each time step. We prove that this simple variant of Thompson sampling achieves
the minimax optimal regret bound O(√(KT)) for finite time horizon T and
also the asymptotic optimal regret bound when T grows to infinity as well.
This is the first time that the minimax optimality of multi-armed bandit
problems has been attained by Thompson sampling type of algorithms.
In text/plain
format
Archived Files and Locations
application/pdf 609.6 kB
file_pfemsq7pvbde5nktgauwd4aani
|
arxiv.org (repository) web.archive.org (webarchive) |
2003.01803v1
access all versions, variants, and formats of this works (eg, pre-prints)