Sample complexity of inverse reinforcement learning in linear quadratic regulator

In the context of control theory, Reinforcement Learning (RL) is a key framework for solving complex problems. In particular, the environment of Linear Quadratic Regulator (LQR) is one of the most used benchmarks for its tractability and real-world adaptability. Traditionally, solutions to this problem assume knowledge of both the environment dynamics and the cost function. RL, through sampling, lifts the first assumption and learns (implicitly or explicitly) an estimate of the model, while still needing a reward function. Inverse Reinforcement Learning (IRL) circumvents this limitation and learns a reward function by observing an expert agent, opening the door to transfer learning and explainability. In recent works, the idea of a set of feasible rewards has been introduced to face the ill-posed nature of IRL, by recovering all the compatible solutions with the expert's demonstration; allowing the IRL problem to ensure a clear and unique goal. This thesis focuses on applying IRL to the LQR setting, specifically to recover the set of feasible rewards. We provide sample complexity bounds for the algorithm proposed under the Probably Approximately Correct (PAC) Learning framework. Unlike previous approaches limited to finite state-action spaces like tabular MDP, our approach extends feasible reward set IRL to continuous state-action spaces, providing an algorithm to learn all the compatible rewards in a LQR setting.

Nel contesto della teoria del controllo, il Reinforcement Learning (RL) rappresenta un approccio fondamentale per la risoluzione di problemi complessi. In particolare, l'ambiente del Regolatore Lineare Quadratico (LQR) è uno dei benchmark più utilizzati grazie alla sua trattabilità e adattabilità al mondo reale. Tradizionalmente, le soluzioni a questo problema assumono la conoscenza sia delle dinamiche dell'ambiente che della funzione di costo. Il RL, attraverso il campionamento, elimina la prima assunzione e apprende (in modo implicito o esplicito) una stima del modello, ma richiede comunque una funzione di ricompensa. L'Inverse Reinforcement Learning (IRL) supera questa limitazione, apprendendo una funzione di ricompensa osservando un agente esperto, aprendo così la strada al transfer learning e all'explainability. Questa tesi si concentra sull'applicazione dell'IRL all'LQR, in particolare sul recupero dell'insieme delle ricompense ammissibili, un concetto introdotto di recente per affrontare la natura mal posta dell'IRL. L'obiettivo è apprendere l'insieme contenente tutte le funzioni di ricompensa compatibili con il comportamento dell'esperto, invece di stimarne una singola. Forniamo garanzie sulla complessità campionaria per l'algoritmo proposto nel contesto del framework Probably Approximately Correct (PAC) learning. A differenza di approcci precedenti, limitati a spazi stato-azione finiti come gli MDP tabulari, il nostro metodo estende l'IRL con insiemi di ricompense ammissibili agli spazi stato-azione continui, fornendo un algoritmo per apprendere tutte le ricompense compatibili in un ambiente LQR.