Multi-dimensional Reward Learning from preference feedback

Reinforcement Learning (RL) is a powerful framework for solving sequential decision-making problems by maximizing the accumulated sum of a suitable scalar reward signal. The choice of the reward is fundamental in obtaining good performances on the task, however, defining such reward is often challenging, requiring lots of task-specific knowledge and often needing to account for different underlying objectives. Several approaches try to overcome this criticality. Preference-based Reinforcement Learning (PbRL) algorithms learn to solve the task by asking a human expert for preferences over examples of solutions for the task (called trajectories), instead of requiring a reward; usually, this is done by learning a surrogate reward (called reward model) from comparison data, and then utilize standard RL techniques. An additional challenge arises when the agent has to handle multiple, and often contrasting, objective. In the literature, Multi-Objective Reinforcement Learning (MORL) enables the specification of multi-dimensional rewards, allowing to effectively learn several policies for different combinations of the objectives. When translating the multi-objective problem to PbRL, this corresponds to an expert not being able to provide an explicit preference since the trajectories may be incomparable, meaning that neither is better than the other w.r.t. all the objectives. This possibility enables learning a reward model that encodes the various dimensions of the task, ultimately obtaining a MORL problem instance. In this work, we present the novel setting in which an algorithm learns a multi-dimensional reward model from preference data. We formalize this setting, define the desiderata that a preference model needs to satisfy, and present an important negative result concerning the optimization procedure of such models. We then define three choices of models that extend the Bradley-Terry (BT) model (a probabilistic model adopted ubiquitously in standard reward modeling) and that comply, in part or completely - accounting for optimization guarantees - with the aforementioned desiderata. Finally, we validate these preference models through various experiments, focusing on their ability of learning from small amount of data and from non-optimal experts.

Il Reinforcement Learning (RL) è un potente framework per risolvere problemi di decisione sequenziale massimizzando la somma accumulata di un opportuno segnale di ricompensa scalare. La scelta della ricompensa è fondamentale per ottenere buone prestazioni sul problema; tuttavia, definirla è spesso difficile, richiedendo molta conoscenza del dominio e la necessità di considerare diversi obiettivi sottostanti. Diversi approcci cercano di superare questa criticità. Gli algoritmi di Preference-based Reinforcement Learning (PbRL) imparano a risolvere il compito chiedendo a un esperto umano di esprimere preferenze tra esempi di soluzioni del problema (chiamate traiettorie), invece di richiedere una ricompensa esplicita; solitamente, ciò avviene apprendendo una ricompensa surrogata (chiamata reward model) da queste preferenze, per poi utilizzare tecniche standard di RL. Una sfida aggiuntiva emerge quando l’agente deve gestire obiettivi multipli, spesso in contrasto tra loro. In letteratura, il Multi-Objective Reinforcement Learning (MORL) consente la specificazione di ricompense multidimensionali, permettendo di apprendere efficacemente diverse politiche per diverse combinazioni degli obiettivi. Quando si traduce il problema multi-obiettivo in PbRL, ciò corrisponde a un esperto che non è in grado di fornire una preferenza poiché le traiettorie possono essere incomparabili, ossia, nessuna è migliore dell’altra rispetto a tutti gli obiettivi. Questa possibilità consente di apprendere un reward model che codifica le varie dimensioni del problema, ottenendo un’istanza di problema di MORL. In questo lavoro presentiamo un nuovo scenario in cui un algoritmo apprende un reward model multidimensionale a partire da preferenze. Formalizziamo questo setting, definiamo i requisiti (desiderata) che il modello di preferenza deve soddisfare e presentiamo un importante risultato negativo riguardante l'ottimizzazione di tali modelli. Definiamo poi tre scelte di modelli che estendono il modello di Bradley-Terry (BT) — un modello probabilistico ampiamente adottato in reward modeling classico — e che soddisfano, in parte o completamente (considerando le garanzie di ottimizzazione), i suddetti requisiti. Infine, validiamo questi modelli tramite diversi esperimenti, focalizzandoci sulla loro capacità di apprendere da dataset limitati e da esperti non ottimali.