In recent years, task specification using Temporal Logic has become a prominent area of research in Reinforcement Learning (RL). These methods enable agents to learn objectives defined through logical specifications, offering strong theoretical guarantees and performing well in simple environments. However, they often struggle with more complex tasks due to the sparsity of the rewards they generate, making learning inefficient. This thesis directly addresses the challenge of reward sparsity to improve learning efficiency in this context. We propose a novel approach that combines two existing techniques: reward machines (RMs) and hindsight experience replay (HER). We explain their individual mechanisms, how they can be integrated, and the challenges that arise in doing so. Building on this foundation, we introduce two methods designed for off-policy RL algorithms to handle complex tasks more effectively. Our experiments in a continuous environment across various tasks demonstrate that the proposed methods can significantly improve performance in scenarios where conventional logic reward specification struggles.
Negli ultimi anni, la specifica dei compiti mediante la Logica Temporale è diventata un'area di ricerca rilevante nel Reinforcement Learning (RL). Questi metodi permettono agli agenti di apprendere obiettivi definiti tramite specifiche logiche, offrendo solide garanzie teoriche e ottenendo buone prestazioni in ambienti semplici. Tuttavia, spesso incontrano difficoltà con compiti più complessi a causa deli reward sparsi generati, rendendo l'apprendimento inefficiente. Questa tesi affronta direttamente il problema dei reward sparsi per migliorare l'efficienza dell'apprendimento in questo contesto. Proponiamo un nuovo approccio che combina due tecniche esistenti: le reward machines (RMs) e il hindsight experience replay (HER). Spieghiamo il funzionamento di ciascun metodo, come possono essere integrati e le sfide che emergono in questo processo. Su questa base, introduciamo due metodi progettati per algoritmi RL off-policy, con l'obiettivo di affrontare in modo più efficace compiti complessi. I nostri esperimenti, condotti in un ambiente continuo su diversi compiti, dimostrano che i metodi proposti possono migliorare significativamente le prestazioni in scenari in cui la specifica dei reward basata sulla logica incontra difficoltà.
Addressing sparsity in reinforcement learning with logic reward specification
Lasca, Fausto
2023/2024
Abstract
In recent years, task specification using Temporal Logic has become a prominent area of research in Reinforcement Learning (RL). These methods enable agents to learn objectives defined through logical specifications, offering strong theoretical guarantees and performing well in simple environments. However, they often struggle with more complex tasks due to the sparsity of the rewards they generate, making learning inefficient. This thesis directly addresses the challenge of reward sparsity to improve learning efficiency in this context. We propose a novel approach that combines two existing techniques: reward machines (RMs) and hindsight experience replay (HER). We explain their individual mechanisms, how they can be integrated, and the challenges that arise in doing so. Building on this foundation, we introduce two methods designed for off-policy RL algorithms to handle complex tasks more effectively. Our experiments in a continuous environment across various tasks demonstrate that the proposed methods can significantly improve performance in scenarios where conventional logic reward specification struggles.File | Dimensione | Formato | |
---|---|---|---|
2025_04_Lasca_Thesis.pdf
accessibile in internet per tutti
Descrizione: Thesis
Dimensione
3.2 MB
Formato
Adobe PDF
|
3.2 MB | Adobe PDF | Visualizza/Apri |
2025_04_Lasca_Executive_Summary.pdf
accessibile in internet per tutti
Descrizione: Executive Summary
Dimensione
1.17 MB
Formato
Adobe PDF
|
1.17 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/235557