Thompson Sampling for the MNL-Bandit release_npwci4eeqbf5dk74ixif6xoiai

by Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi

Released as a article .



We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality K from N possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon T, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.
In text/plain format

Archived Files and Locations

application/pdf  652.4 kB
file_ahxjmgs36vbc3pxxcj2e3f67su (repository) (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2017-06-03
Version   v1
Language   en ?
arXiv  1706.00977v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 0ae7acf5-6e71-4406-a8b9-2e106327d084