Exploiting sub-optimal expert behaviors in inverse reinforcement learning

Inverse Reinforcement Learning (IRL) addresses the problem of inferring a reward function from an expert's behavior to explain its underlying preferences. The primary challenge of IRL is its ill-posed nature, as a single expert behavior can be explained by an infinite number of reward functions. However, real-world problems often involve demonstrations from multiple experts with varying skill levels and thus, could provide a greater insight into this problem. For this reason, this thesis extends the framework of IRL to consider the behaviors of multiple and sub-optimal agents. Our approach consists of studying the feasible reward set, i.e. the set of reward functions that are compatible with the demonstrations, and how sub-optimal experts impact its structure. Our findings suggest that considering sub-optimal agents can significantly reduce the amount of ambiguity that is inherent to IRL. Furthermore, we provide a complexity analysis within the sub-optimal expert IRL context. We establish a lower bound on the samples required to accurately approximate the feasible reward set and analyze the statistical guarantee of a uniform sampling algorithm, which is found to be minimax optimal when there is a greater reduction in ambiguity. Finally, we validate our theoretical propositions through an experiment to demonstrate that our approach provides a fundamental advantage in dealing with IRL ambiguity.

L'Inverse Reinforcement Learning (IRL) si occupa di dedurre una funzione di ricompensa osservando il comportamento di un esperto, al fine di comprendere le sue preferenze. La principale sfida dell'IRL deriva dal suo essere un problema intrinsecamente mal definito: il comportamento di un unico esperto può corrispondere a un'infinità di possibili funzioni di ricompensa. Nella pratica, tuttavia, spesso si interagisce con più esperti di diverso livello, offrendo così una visione più ampia del problema. In questo contesto, la presente tesi amplia il framework dell'IRL includendo l'analisi dei comportamenti di diversi agenti sub-ottimali. Ci concentriamo sullo studio del feasible reward set, ossia l'insieme delle funzioni di ricompensa che sono compatibili con le osservazioni. Analizzando l'impatto degli esperti sub-ottimali su questa struttura, i risultati indicano che l'incorporazione di tali agenti può diminuire significativamente l'ambiguità intrinseca all'IRL. Inoltre, viene condotta un'analisi dettagliata della complessità nel contesto dell'IRL con esperti sub-ottimali. Stabiliamo un limite inferiore sul numero di campioni necessari per approssimare accuratamente il feasible reward set e valutiamo le prestazioni di un algoritmo di campionamento uniforme. Tale algoritmo si è dimostrato ottimale in scenari dove l'ambiguità è maggiormente ridotta. Infine, le nostre ipotesi teoriche vengono validate attraverso un esperimento, dimostrando che il nostro approccio offre un significativo vantaggio nel gestire l'ambiguità dell'IRL.