AutoOPE: Automated Estimator Selection for Off-Policy Evaluation

This thesis presents an innovative approach for addressing the Estimator Selection problem in Off-Policy Evaluation (OPE) for Contextual Bandit. Contextual Bandit problems, prevalent in recommender systems, information retrieval, and ad-placement systems, pose significant challenges in decision-making under uncertainty. A crucial issue is evaluating new policies without real-world deployment, especially where suboptimal policies can have negative consequences, such as revenue loss or user dissatisfaction. The thesis introduces a novel method, AutoOPE, that significantly deviates from existing literature, where Estimator Selection has been largely overlooked. AutoOPE utilizes a machine learning model to guide Estimator Selection in OPE. To train the model, the method employs a meta-learning strategy based on synthetic data to understand how to generalize on real-world datasets. The model is a Random Forest regressor that aims to predict an estimator's Mean Squared Error (MSE) in a given OPE task, based on the OPE task characteristics, and on the properties of the given estimator. The meta-learning is in the training of the regressor. It involves fitting the regression model on a dataset constituted by tuples, each one representing an OPE task and an estimator applied to it. Tuples are composed by input features extrapolated from a diverse range of synthetically generated OPE datasets reflecting various OPE scenarios, by input features obtained from various OPE estimators of different kind, and by a target variable, that is the performance of the estimator considered, applied to a generated OPE dataset. The model then guides the selection of the most appropriate estimator for a given real-world OPE task, computing automatically the features needed for that task, and predicting zero-shot the MSE of each estimator. The best estimator is identified as the one with the dominant predicted performance over all the others. Experimental evaluations of AutoOPE on real-world datasets like CIFAR-10 and the Open Bandit Dataset, demonstrate its effectiveness. The results reveal AutoOPE's superior performance in terms of accuracy and robustness compared to the existing baseline method, PAS-IF. Furthermore, AutoOPE exhibits greater computational efficiency, making it a viable solution for practical applications. This thesis contributes significantly to the field of Contextual Bandit and OPE by providing a reliable, efficient, and adaptable solution for Estimator Selection. The findings not only address a crucial gap in the literature but also open new pathways for future research and applications in Off-Policy Evaluation.

Questa tesi presenta un approccio innovativo al problema della selezione di stimatori per valutazione Off-Policy (OPE) in scenari di bandits contestuali. I problemi di bandits contestuali, prevalenti nei sistemi di raccomandazione, e nei sistemi di advertising, pongono sfide significative nel prendere decisioni in condizioni di incertezza. Una problema importante in questi ambiti è quello di riuscire a valutare accuratamente nuove policy senza doverle impiegare nel mondo reale. E' un task che diventa fondamentale nel caso in cui attuare policy subottimali può avere conseguenze molto negative, come la perdita di guadagno o l'insoddisfazione di utenti. Questa tesi introduce un nuovo metodo, chiamato Automated Off-Policy Evaluation (AutoOPE), che si discosta significativamente dalla letteratura esistente, nella quale la selezione degli stimatori è stata ampiamente trascurata. AutoOPE utilizza un modello di apprendimento automatico per guidare la selezione degli stimatori in problemi di OPE. Per addestrare il modello, è stata utilizzata una strategia di meta-apprendimento basata su dati sintetici. Da questi dati il modello apprende come generalizzare su set di dati reali. Il modello è un Random Forest che mira a stimare l'errore quadratico medio (MSE) di uno stimatore, e lo fa sulla base delle caratteristiche del task in cui è applicato, e sulla base delle proprietà dello stimatore in questione. Il meta-apprendimento consiste nell'addestramento supervisionato del modello, su un set di dati in cui gli input sono caratteristiche estrapolate da vari dataset di OPE generati sinteticamente, che riflettono vari scenari possibili, e da vari stimatori di diversa tipologia. Le variabili target invece sono le prestazioni di ciascun tipo di stimatore considerato, in termini di MSE. Dato un determinato task di OPE, il modello guida quindi la selezione dello stimatore più appropriato, calcolando automaticamente le caratteristiche relative a quel task e predicendo l'MSE di ogni stimatore. Il miglior stimatore viene identificato come quello con prestazioni predette dominanti rispetto a tutti gli altri. Le valutazioni sperimentali di AutoOPE su set di dati reali, come CIFAR-10 e Open Bandit Dataset, dimostrano la sua efficacia. I risultati rivelano che AutoOPE ha prestazioni superiori in termini di accuratezza e robustezza rispetto all'unico metodo di riferimento esistente, PAS-IF. Inoltre, AutoOPE presenta una maggiore efficienza computazionale, che lo rende una soluzione molto appetibile per applicazioni pratiche. Questa tesi contribuisce in modo significativo all'ambito dell'OPE in problemi di bandits contestuali, fornendo una soluzione promettente, efficiente e adattabile per la selezione degli stimatori. I risultati promettenti aprono nuove strade per la ricerca e le applicazioni future nel campo dell'OPE.