There are situations in which the definition of reward function, thus, the function that motivates a certain behaviour, is difficult to define because there are too many factors to take into consideration, or it is complicated to describe formally that task. Inverse Reinforcement Learning address the problem of finding the unknown reward function, that characterizes a task, through the observation of another agent performing that task. However, the agent considered in the methods under this class is usually an expert, i.e. that has completed her learning process and can optimally perform the assignment. Thus, the observer, the agent that is perceiving her demonstration, when trying to learn the task she will not posses the knowledge regarding the consequence of all the possible choices not considered optimal, and this may end up in critical situations, harmful to the learning process. In this thesis we propose an Inverse Reinforcement Learning method named Learning Observing a Gradient not-Expert Learner (LOGEL), which considers that the other agent is observed while she is learning, hence, clearly, she is not an expert. The observer exploits a set of demonstration produced by the learner, to first infer the policy followed and then estimate the reward function parameters needed in order to learn herself. The algorithm exhibits significant results in a discrete gridworld environment, being able to retrieve a good reward function from different learners following various Reinforcement Learning algorithms. It shows good results also when tested in two continuous environments of the MuJoCo control suite, Hopper and Reacher, being able to always recover a good reward function even if the problem were more complex.
Ci sono situazioni in cui la definizione della funzione di ricompensa, quindi la funzione che motiva un certo comportamento, è difficile da definire perchè o ci sono troppi fattori da prendere in considerazione, o è complicatio de- scrivere formalmente tale compito. L’Inverse Reinforcement Learning è una branca dell’Intelligenza Artificiale che studia algoritmi che permettono di trovare la funzione di ricompensa, osservando il comportamento di un altro agente. Tuttavia, normalmente, l’agente è considerato esperto, cioè che ha finito il suo percorso di apprendimento e quindi è in grado di effettuare le scelte migliori per raggiungere l’obiettivo preposto. In questo modo però, l’agente che osserva, non osserva come questo si districa in situazioni critiche durante il suo apprendimento e perciò non è in grado di risolvere queste situazioni, qualora se le trovasse di fronte. Lo scopo di questa tesi è quello di proporre un algoritmo di apprendimento inverso chiamato Learning Ob- serving a Gradient not-Expert Learner (LOGEL), che considera l’agente osservato, anch’esso in fase di apprendimento. In primo luogo è stimata la politica seguita dall’agente in fase di apprendimento, poi, utilizzando i parametri appena scoperti, è stimata la funzione di ricompensa per permet- tere anche al secondo agente, quello che sta osservando il primo, di imparare come svolgere il compito in esame. L’algoritmo mostra risultati significativi in un ambiente discreto in forma a griglia, essendo in grado di recuperare una buona funzione di ricompensa da agenti che seguendo vari algoritmi di apprendimento. Inoltre, mostra risultati confortanti anche quando tes- tato in ambienti continui, essendo in grado di recuperare sempre una buona funzione di ricompensa anche quando il problema diventa più complesso.
Gradient-based approach to inverse reinforcement learning by observing a not-expert demonstrator
Drappo, Gianluca
2019/2020
Abstract
There are situations in which the definition of reward function, thus, the function that motivates a certain behaviour, is difficult to define because there are too many factors to take into consideration, or it is complicated to describe formally that task. Inverse Reinforcement Learning address the problem of finding the unknown reward function, that characterizes a task, through the observation of another agent performing that task. However, the agent considered in the methods under this class is usually an expert, i.e. that has completed her learning process and can optimally perform the assignment. Thus, the observer, the agent that is perceiving her demonstration, when trying to learn the task she will not posses the knowledge regarding the consequence of all the possible choices not considered optimal, and this may end up in critical situations, harmful to the learning process. In this thesis we propose an Inverse Reinforcement Learning method named Learning Observing a Gradient not-Expert Learner (LOGEL), which considers that the other agent is observed while she is learning, hence, clearly, she is not an expert. The observer exploits a set of demonstration produced by the learner, to first infer the policy followed and then estimate the reward function parameters needed in order to learn herself. The algorithm exhibits significant results in a discrete gridworld environment, being able to retrieve a good reward function from different learners following various Reinforcement Learning algorithms. It shows good results also when tested in two continuous environments of the MuJoCo control suite, Hopper and Reacher, being able to always recover a good reward function even if the problem were more complex.File | Dimensione | Formato | |
---|---|---|---|
DrappoTesi.pdf
solo utenti autorizzati dal 26/11/2021
Dimensione
713.57 kB
Formato
Adobe PDF
|
713.57 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/170829