Smoothed OMD: an Algorithm for No-regret Learning in Adversarial MDPs with Revealed Transitions

Reinforcement Learning (RL) has proven to be an effective framework for learning in Markov Decision Processes (MDPs). As its application in complex and realistic environments grows, the need for robust solutions becomes increasingly important. Despite their success, many current RL algorithms struggle to handle uncertainties, disturbances, and structural changes within the environment effectively. Principles from the modern robust control theory have been extensively applied to achieve these “robustness” guarantees. However, these approaches often result in overly pessimistic controllers that fail to achieve satisfactory performance in scenarios that are not minmax. A framework providing favorable robustness guarantees while avoiding over-conservatism is represented by Adversarial MDPs (AMDPs). In this framework, the controller, or the agent must adapt to a sequence of arbitrarily chosen transition models and losses for a given number of rounds. Such a model has been introduced to overcome the limitations associated with the “static” nature of the standard MDP model and to be able to adapt to arbitrary (not worst-case) adversaries. In this thesis, we consider a variant of the AMDP model in which the transition model chosen by the adversary is revealed at the end of each episode. We propose the notion of “Smoothed MDP” whose transition model aggregates with a generic function the ones experienced so far. Coherently, we define the concept of “smoothed regret”, and we devise the algorithm "Smoothed Online Mirror Descent” (SOMD), an enhanced version of OMD that leverages a novel regularization term to effectively learn in this setting. For specific choices of the function defining the smoothed MDPs we show that, under “full feedback”, we are able to achieve a regret that scales sub-linearly with the number of rounds and depends linearly on a parameter that describes the degree of maliciousness of the adversary. Furthermore, we extend our approach to a more challenging “bandit feedback” setting on the losses. Our algorithm guarantees the same level of performance of the “full feedback” setting by using a simple importance-weighted estimator on the loss function.

L'Apprendimento per Rinforzo (RL) è un framework efficace per la risoluzione di problemi decisionali modellati come Processi Decisionali di Markov (MDPs). Tuttavia, per poter applicare questo paradigma a problemi di interesse pratico, risulta necessario sviluppare algoritmi che godano di una proprietà detta “robustezza”. Gli algoritmi di RL faticano infatti a garantire performance soddisfacenti in presenza di incertezze o disturbi nel sistema da controllare. Soluzioni a questo problema si sono ispirate ai principi della moderna teoria del controllo robusto. I metodi così originati, tuttavia, risultano essere eccessivamente conservativi e non eccellono in scenari che non siano di caso peggiore. Un framework alternativo a questo, è rappresentato dai Processi Decisionali di Markov Avversariali (AMDPs). Qui, il controllore, o l’agente, deve adattarsi per un dato numero di round a una sequenza di modelli di transizione e funzioni di costo scelti da un avversario. Questo paradigma, permette di ottenere algoritmi in grado di adattarsi a incertezze e disturbi arbitrari, e non solo di caso peggiore. Le performance ottenute dipenderanno dal grado di ostilità dell’avversario. In questa tesi, consideriamo una variante del modello AMDP in cui il modello di transizione scelto dall’avversario viene rivelato alla fine di ogni episodio. Proponiamo dunque il concetto di Smoothed MDP, caratterizzato da un modello di transizione che è una funzione di quelli osservati fino a quel momento. Coerentemente, definiamo il concetto di smoothed regret, e proponiamo l’algoritmo Smoothed Online Mirror Descent (SOMD), il quale, tramite un nuovo termine di regolarizzazione, apprende efficacemente in questo contesto. Per scelte specifiche della funzione che definisce il modello di transizione di uno Smoothed MDP, dimostriamo che, in un setting full feedback, siamo in grado di ottenere un regret che scala sub-linearmente con il numero di round e dipende linearmente da un parametro che descrive il grado di ostilità dell’avversario. Inoltre, assumendo un modello bandit feedback sulle funzioni di costo, il nostro algoritmo garantisce lo stesso livello di prestazioni grazie ad un semplice stimatore Importance Weighted.