We consider a sequential subset selection problem under parameter
uncertainty, where at each time step, the decision maker selects a subset of
cardinality K from N possible items (arms), and observes a (bandit)
feedback in the form of the index of one of the items in said subset, or none.
Each item in the index set is ascribed a certain value (reward), and the
feedback is governed by a Multinomial Logit (MNL) choice model whose parameters
are a priori unknown. The objective of the decision maker is to maximize the
expected cumulative rewards over a finite horizon T, or alternatively,
minimize the regret relative to an oracle that knows the MNL parameters. We
refer to this as the MNL-Bandit problem. This problem is representative of a
larger family of exploration-exploitation problems that involve a combinatorial
objective, and arise in several important application domains. We present an
approach to adapt Thompson Sampling to this problem and show that it achieves
near-optimal regret as well as attractive numerical performance.
Archived Files and Locations
|application/pdf 652.4 kB ||