Quasi-safe bandit algorithms for the bid optimization problem in online advertising

Over the past decade, the annual volume of investments in online advertising campaigns has quintupled. In the US market alone, during 2020, it is estimated that this volume exceeds 135 billion dollars, making it the primary advertising medium. The fundamental advantage of this medium is the control capacity that advertisers have in terms of both personalization of campaigns and evaluation of their impact. The amount of performance data and the wide range of ad types make the optimization of these campaigns not addressable by hand. The development of algorithms for optimization in sequential decision-making problems is well known in the field of Reinforcement Learning. In this work, we focus on the Stochastic Bandit framework to propose solutions that take into account real business constraints, such as budget constraints and return on investment (ROI) constraints. In the dissertation, we present two algorithms respectively for the Stochastic Multi-Armed Bandit framework with budget constraints and Stochastic Multi-Armed Bandit with ROI constraints. We analyze the theoretical performance of the algorithms by showing sublinear bounds for the regret. A necessary condition to find sublinear bounds is to allow a violation of the constraints up to a selected threshold. Thus, we also investigate the relationship between the expected number of intolerable violations of the constraints and the threshold of intolerability, which is an aspect that results unexplored at a theoretical level. Finally, we will extend the algorithms, and their analysis, to the combinatorial context with Semi-Bandit feedback. This allows the use of these algorithms in the more realistic scenario in which advertisers' campaigns are composed by several sub-campaigns diversified, for example, by target, channel or context.

Nel corso dell'ultimo decennio il volume annuale degli investimenti in campagne pubblicitarie online è quintuplicato. Nel solo mercato USA durante il 2020 si stima che tale volume superi i 135 miliardi di dollari, attestandosi come principale mezzo pubblicitario. Il fondamentale vantaggio di questo mezzo è la capacità di controllo che gli inserzionisti hanno in termini sia di personalizzazione delle campagne sia di valutazione sull'impatto che esse sortiscono. La mole di dati sulle performance e l'ampio ventaglio di tipologie di annunci rendono l'ottimizzazione di queste campagne non percorribile manualmente. Lo sviluppo di algoritmi per problemi di ottimizzazione in processi decisionali iterati è ben noto nel campo del Reinforcement Learning. In questo lavoro ci focalizziamo sul framework Stochastic Bandit con l'obiettivo di proporre soluzioni che tengano in conto vincoli di business reali, quali vincoli di budget e vincoli di ritorno sull'investimento (ROI). Nell'elaborato presentiamo due algoritmi rispettivamente per il framework Stochastic Multi-Armed Bandit con vincoli di budget e Stochastic Multi-Armed Bandit con vincoli di ROI. Analizziamo le prestazioni teoriche degli algoritmi mostrando che ammettono bound sublineari per il regret. Condizione necessaria per ottenere i suddetti bound è ammettere una violazione dei vincoli entro una certa soglia di tollerabilità. Dunque, indaghiamo il rapporto tra numero atteso di violazioni intollerabili dei vincoli e tale soglia di intollerabilità, aspetto ancora inesplorato a livello teorico. Estendiamo infine gli algoritmi, e l'analisi degli stessi, al contesto combinatorio con feedback Semi-Bandit. Questo permette l'uso di tali algoritmi nello scenario, più realistico, in cui le campagne degli inserzionisti si compongano di più sotto-campagne, diversificate ad esempio per canale, target o contesto.