MOTS: Minimax Optimal Thompson Sampling release_ats7kkvffnhhlbjnotbv77ihaa

by Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, Quanquan Gu

Released as a article .

2020  

Abstract

Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can achieve the minimax optimal regret O(√(KT)) for K-armed bandit problems, where T is the total time horizon. In this paper, we solve this long open problem by proposing a new Thompson sampling algorithm called MOTS that adaptively truncates the sampling result of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound O(√(KT)) for finite time horizon T and also the asymptotic optimal regret bound when T grows to infinity as well. This is the first time that the minimax optimality of multi-armed bandit problems has been attained by Thompson sampling type of algorithms.
In text/plain format

Archived Files and Locations

application/pdf  609.6 kB
file_pfemsq7pvbde5nktgauwd4aani
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2020-03-03
Version   v1
Language   en ?
arXiv  2003.01803v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 41f8725c-8e65-43c0-a145-44704a0e6226
API URL: JSON