Nearly Minimax Optimal Regret for Multinomial Logistic Bandit (Top 0.2%, 32/15671)
Joongkyu Lee, Min-hwan Oh
September, 2024Abstract
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size . Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of and propose a constant-time algorithm, , that achieves a matching upper bound of . Under non-uniform rewards, we prove a lower bound of and an upper bound of , also achievable by . Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality — for either uniform or non-uniform reward setting — and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
Publication
Neural Information Processing Systems (NeurIPS), 2024

Ph.D. candidate in Data Science