Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit release_abp3jwxqdne77bsizabococ4uy

by Priyank Agrawal, Vashist Avadhanula, Theja Tulabandhula

Released as a article .



We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most K, out of N items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as Õ(κ√(T)), where κ is an instance dependent constant. κ could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys Õ(√(T)) regret. In our bounds, the κ factor only affects the poly-log term and not the leading term of the regret bounds.
In text/plain format

Archived Files and Locations

application/pdf  784.8 kB
file_sdobkh5rf5a63mt5xxfq3t2aqq (repository) (webarchive)
Read Archived PDF
Type  article
Stage   submitted
Date   2020-11-28
Version   v1
Language   en ?
arXiv  2011.14033v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: f75a7534-7e39-4186-9b5e-1faefc91cd1b