{"abstract":"In this paper, we consider the contextual variant of the MNL-Bandit problem.\nMore specifically, we consider a dynamic set optimization problem, where in\nevery round a decision maker offers a subset (assortment) of products to a\nconsumer, and observes their response. Consumers purchase products so as to\nmaximize their utility. We assume that the products are described by a set of\nattributes and the mean utility of a product is linear in the values of these\nattributes. We model consumer choice behavior by means of the widely used\nMultinomial Logit (MNL) model, and consider the decision maker's problem of\ndynamically learning the model parameters, while optimizing cumulative revenue\nover the selling horizon $T$. Though this problem has attracted considerable\nattention in recent times, many existing methods often involve solving an\nintractable non-convex optimization problem and their theoretical performance\nguarantees depend on a problem dependent parameter which could be prohibitively\nlarge. In particular, existing algorithms for this problem have regret bounded\nby $O(\\sqrt{\\kappa d T})$, where $\\kappa$ is a problem dependent constant that\ncan have exponential dependency on the number of attributes. In this paper, we\npropose an optimistic algorithm and show that the regret is bounded by\n$O(\\sqrt{dT} + \\kappa)$, significantly improving the performance over existing\nmethods. Further, we propose a convex relaxation of the optimization step which\nallows for tractable decision-making while retaining the favourable regret\nguarantee.","author":[{"family":"Agrawal"},{"family":"Avadhanula"},{"family":"Tulabandhula"}],"id":"unknown","issued":{"date-parts":[[2021,3,7]]},"language":"en","title":"A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit","type":"article"}