Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit
release_abp3jwxqdne77bsizabococ4uy
by
Priyank Agrawal, Vashist Avadhanula, Theja Tulabandhula
2020
Abstract
We consider a dynamic assortment selection problem where the goal is to offer
a sequence of assortments of cardinality at most K, out of N items, to
minimize the expected cumulative regret (loss of revenue). The feedback is
given by a multinomial logit (MNL) choice model. This sequential decision
making problem is studied under the MNL contextual bandit framework. The
existing algorithms for MNL contexual bandit have frequentist regret guarantees
as Õ(κ√(T)), where κ is an instance
dependent constant. κ could be arbitrarily large, e.g. exponentially
dependent on the model parameters, causing the existing regret guarantees to be
substantially loose. We propose an optimistic algorithm with a carefully
designed exploration bonus term and show that it enjoys
Õ(√(T)) regret. In our bounds, the κ factor only
affects the poly-log term and not the leading term of the regret bounds.
In text/plain
format
Archived Files and Locations
application/pdf 784.8 kB
file_sdobkh5rf5a63mt5xxfq3t2aqq
|
arxiv.org (repository) web.archive.org (webarchive) |
2011.14033v1
access all versions, variants, and formats of this works (eg, pre-prints)