Efficient Online Learning under Bandit Feedback

Sammanfattning: In this thesis we address the multi-armed bandit (MAB) problem with stochastic rewards and correlated arms. Particularly, we investigate the case when the expected rewards are a Lipschitz function of the arm and extend these results to bandits with arbitrary structure that is known to the decision maker. In these settings, we derive problem specific regret lower bounds and propose both an asymptotically optimal algorithm (OSLB and OSSB respectively) and (pareto) optimal algorithms (POSLB and POSB, in the generic setting). We further examine the \emph{learning to rank} problem, as viewed from a MAB perspective. We construct the regret lower bound and determine its closed form for some particular settings, as well as propose two asymptotically optimal algorithms PIE and PIE-C. We further present a mathematical model of the learning to rank problem where the need for diversity appears naturally and devise an order optimal, numerically competitive algorithm, LDR. For all algorithms mentioned above, we present performance analysis in the form of theoretical regret guarantees as well as numerical evaluation on artificial as well as real-world datasets.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)