Online Learning with Risk Averse Tree in Non-stationary Environments

The e-commerce market has been growing in the past decades and offers many opportunities for automatic systems, e.g., online learning techniques, that are able to optimize the pricing policy they follow. Before the introduction of such systems, the pricing strategies were controlled by experts in the field and their decisions and the feedbacks received by the users were stored in huge databases. These logs, in the form of logged bandit feedbacks, provide a source of information that classical online learning techniques are not able to handle. We designed a novel algorithm, namely Online Risk Averse Tree (ORAT), for incorporating the information provided by the logged bandit feedbacks and taking decisions for the online part of the process. The novelties of this algorithm reside in the ability of handling cases in which the decision space is continuous, by means of a Gaussian process modeling, and that it is able to work in non-stationary user behaviors. More specifically, ORAT builds a binary tree from the logged data using an algorithm called RADT and, if needed, it adapts the structure of the tree to the changes in the users distribution. A wide experimental campaign over synthetic instances of the problem has been performed, showing the improved performance of what has been proposed w.r.t. the methods existing in the scientific literature. After having introduced and described the problem setting, we present the algorithm in details. Then we first make a comparison between RADT and the extension that uses the Gaussian processes, and finally, we compare ORAT with a set of stationary policies in a non-stationary environment.

Il mercato dell'e-commerce è cresciuto molto nelle ultime decadi, con l'avvento di aziende come Amazon e Alibaba. Questo ambiente offre moltissime opportunità a sistemi automatici che apprendono, cioè che sono in grado di ottimizzare le politiche sui prezzi, offrendo il prezzo più consono a ogni richiesta. Prima dell'avvento di queste tecniche, la scelta dei prezzi era lasciata a esperti del settore, che tuttavia non erano in grado di proporre prezzi potenzialmente diversi ad ogni richiesta di ogni utente. L'utilizzo di sistemi automatici ha quindi portato una grande innovazione. Lo studio di questi sistemi di apprendimento è una componente cruciale nell'aumento del profitto di questo genere di aziende. Nonostante tutte le interazioni tra le piattafrome di e-commerce e gli utenti possano essere salvate all'interno di grandi database, tali informazioni non sono al momento sfruttate dagli algoritmi classici di apprendimento online. Abbiamo costruito un nuovo algoritmo, chiamato Online Risk Averse Tree (ORAT), capace di incorporare queste informazioni e successivamente prendere decisioni in base alle richieste degli utenti. Questo algoritmo, differentemente da quelli disponibili nella letteratura scientifica, è in grado di gestire casi in cui lo spazio delle decisioni possibili è continuo, attraverso l'uso dei processi Gaussiani, ed è in grado di agire nel caso in cui i comportamenti degli utenti siano non stazionari, ovvero possano cambiare nel tempo. In particolare, ORAT costruisce un albero binario a partire dai dati salvati inizialmente e, nel caso in cui risulti necessario, adatta la struttura di questo albero in base ai cambiamenti nei comportamenti degli utenti. Questo algoritmo è stato analizzato in molti scenari realistici generati sinteticamente, riuscendo ad ottenere risultati migliori rispetto a metodi presenti nella letteratura.