Nearly Minimax Optimal Regret for Multinomial Logistic Bandit (Top 0.2%, 32/15671)

Joongkyu Lee, Min-hwan Oh

September, 2024

Abstract

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, $\texttt{OFU-MNL+}$, that achieves a matching upper bound of $\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{\mathcal{O}}(d\sqrt{T})$, also achievable by $\texttt{OFU-MNL+}$. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality — for either uniform or non-uniform reward setting — and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

Type

Conference paper

Publication

Neural Information Processing Systems (NeurIPS), 2024

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit (Top 0.2%, 32/15671)

Abstract

Joongkyu Lee

Ph.D. candidate in Data Science