Online model selection with stochastic rising bandits

This work lies in the field of stochastic Multi-Armed Bandits (MABs), i.e., the set of those online learning techniques which sequentially select an action (a.k.a. arm) observing only the feedback given by their choice (a.k.a. reward). A particular case arises when the arms' expected rewards (a.k.a. payoffs) are non-stationary and evolve as monotonic non-decreasing functions. We study that scenario in both the restless and the rested MAB formulation, meaning that the reward evolution is triggered by the natural flow of time or by the agent choices, respectively. This is the case of online model selection tasks, in which one would like to optimize the learning strategy during the learning phase itself. The assumptions under the rising bandit problem allow designing specific algorithms which exploit the regularity of the payoffs to provide tight regret guarantees. We design the R-less-UCB and R-ed-UCB algorithms, respectively for the rising restless and rising rested cases, providing a regret bound made of a problem-dependent component and a problem-independent one, which, under certain mild assumptions on the evolution of the reward function, is of order T^(2/3), being T the learning horizon. Finally, we investigate the effectiveness of the proposed solutions by empirically comparing our novel approaches to state-of-the-art non-stationary bandit algorithms, using both synthetic and real-world data.

Questa tesi si pone nell'ambito dei Multi-Armed-Bandits (MABs) stocastici, ovvero l'insieme di quelle tecniche di selezione online che riescono a scegliere un'azione (in gergo tecnico chiamata arm) osservando solamente il risultato (ricompensa) delle loro scelte precedenti. Uno scenario particolare è quello in cui il valore atteso della ricompensa (payoff) è non-stazionario ed evolve come una funzione monotona non-decrescente. Studiamo questo scenario sia nella formulazione di restless MAB sia in quella di rested MAB, ovvero quando l'evoluzione della ricompensa è causata rispettivamente dallo scorrere del tempo o dalle scelte dell'agente. È proprio quest'ultimo il caso dei problemi di selezione dei modelli online, in cui si vuole ottimizzare il modello di learning durante la fase stessa di learning. Le assunzioni alla base di questa formulazione di rising bandits permettono di creare algoritmi specifici che, sfruttando le regolarità dei payoff, forniscono garanzie sul regret. In particolare sono stati proposti gli algoritmi R-less-UCB e R-ed-UCB, rispettivamente per lo scenario rising restless e rising rested, fornendo un limite superiore sul regret composto da un termine dipendente dall'istanza del problema e da uno indipendente, che, sotto assunzioni relative al tasso di crescita della funzione di guadagno, è dell'ordine di T^(2/3), dove T è l'orizzonte di apprendimento. Infine, è stata investigata l'efficacia delle soluzioni proposte attraverso un confronto empirico con gli algoritmi dello stato dell'arte per i bandit non-stazionari, facendo uso sia di dati generati in modo sintetico, sia di dati dal mondo reale.