Herarchical Reinforcement Learning (HRL) approaches have demonstrated success in solving a wide range of complex, structured, and long-horizon problems. However, a complete theoretical understanding of this empirical success is still lacking. In the context of the option framework, prior research has developed efficient algorithms for scenarios where the options are fixed, and only the high-level policy that selects among these options needs to be learned. Surprisingly, the more realistic scenario where both the high-level and low-level policies are learned has been largely overlooked from a theoretical perspective. This dissertation takes a step toward addressing this gap. Focusing on the finite-horizon setting, we present three provably efficient algorithms to advance the theoretical understanding of this scenario. We examine a specific family of hierarchies defined by two levels, formalising the high-level problem as a Semi-Markov Decision Process (SMDP), while the low-level problem involves learning in a set of finite-horizon MDPs, structured according to the hierarchical definition used. In the first two approaches, the hierarchical structure is defined by a set of options, characterised by their initiation sets and termination conditions. The third approach considers goal-based MDPs with sparse rewards, where the structure is predefined and subdivides the original MDP into a high-level MDP and a set of low-level MDPs, each associated with specific sub-goals. We initially propose a method inspired by the Explore-Then-Commit approach from bandit literature. This method first learns the options' policies and then exploits them to learn the high-level policy, which selects among these options. The second algorithm addresses the simultaneous learning of both levels, mitigating the inherent non-stationarity by continuously alternating between two phases: in each phase, one level learns while the other is kept fixed. Finally, the third method leverages a low-level regret minimiser in each episode to learn the low-level policies for the sub-goals assigned by the high-level policy. The high-level policy plans over the high-level MDP, using the uncertainty propagated by the value function of the low-level policies. The derived bounds are compared with the lower bounds for non-hierarchical finite-horizon problems. This allows us to identify problem structures where hierarchical approaches are provably preferable to standard methods that ignore hierarchical structure, even when no pre-trained low-level policies are available. Lastly, we compare a customised Deep Reinforcement Learning approach with a hierarchical one in a practical, realistic problem: mission planning for a team of autonomous aircraft. We demonstrate how, consistent with the theoretical findings, the hierarchical approach, which exploits structural information, scales effectively with the complexity of the scenario.
L'apprendimento per rinforzo gerarchico (HRL) ha dimostrato di ottenere risultati di successo nel risolvere una vasta gamma di problemi complessi, strutturati e con un lungo orizzonte di ottimizzazione. Tuttavia, manca una totale comprensione teorica di questi successi empirici. Nel contesto del framework delle opzioni, sono stati sviluppati algoritmi efficienti per scenari in cui le opzioni sono fisse e dove e' necessario apprendere solo la politica di alto livello che seleziona tra queste opzioni. Sorprendentemente, lo scenario piu' realistico in cui entrambe le politiche di alto e basso livello vengono apprese e' stato largamente trascurato da una prospettiva teorica. Questa tesi fa un passo avanti in questa direzione. Concentrandosi su problemi a orizzonte finito, vengono presentati tre algoritmi efficienti per colmare questa lacuna, considerando una specifica famiglia di gerarchie definite da due livelli, dove il problema di alto livello e' formalizzato come un processo decisionale Semi-Markoviano (SMDP), mentre a livello inferiore e' considerato l'apprendimento in diversi MDP a orizzonte finito, strutturati secondo la gerarchia. Nei primi due approcci, la struttura gerarchica e' definita da un insieme di opzioni, definite solamente da le loro condizioni di inizializzazione e terminazione. Il terzo approccio, invece, considera MDP basati su obiettivi con ricompense sparse, dove la struttura e' predefinita suddividendo l'MDP originale in un MDP di alto livello e un insieme di MDP di basso livello, ciascuno associato a specifici sotto-obiettivi. Inizialmente proponiamo un metodo ispirato all'approccio Explore-Then-Commit della letteratura sui banditi. Questo metodo apprende prima le politiche delle opzioni e poi le sfrutta per apprendere la politica di alto livello, che seleziona tra queste. Il secondo algoritmo affronta l'apprendimento simultaneo di entrambi i livelli, mitigando la non stazionarieta' intrinseca alternando due fasi: in ogni fase, un livello apprende mentre l'altro rimane fisso. Infine, il terzo metodo sfrutta, in ogni episodio, un minimizzatore di regret standard per apprendere le politiche degli MDP di basso livello condizionate a i sotto-obiettivi assegnati dalla politica di alto livello. La politica di alto livello pianifica sull'MDP di alto livello utilizzando l'informazione di incertezza propagata dalla funzione di valore delle politiche di basso livello. I risultati teorici otteniti vengono confrontati con il lower bound teorico del regret per problemi a orizzonte finito non gerarchici. Questo ci permette di identificare le strutture di problemi in cui gli approcci gerarchici sono dimostrabilmente preferibili rispetto ai metodi standard che ignorano la struttura gerarchica, anche quando non sono disponibili politiche di basso livello pre-addestrate. Infine, confrontiamo un approccio di Deep Reinforcement Learning personalizzato con uno gerarchico in un problema pratico e realistico: la pianificazione delle missioni per una squadra di aeromobili autonomi. Dimostriamo come, in accordo con le scoperte teoriche, l'approccio gerarchico, che sfrutta l'informazione contenuta nella struttura, si adatti efficacemente alla complessita' dello scenario.
Theoretical analysis of hierarchical reinforcement learning and its application for autonomous mission planning
Drappo, Gianluca
2024/2025
Abstract
Herarchical Reinforcement Learning (HRL) approaches have demonstrated success in solving a wide range of complex, structured, and long-horizon problems. However, a complete theoretical understanding of this empirical success is still lacking. In the context of the option framework, prior research has developed efficient algorithms for scenarios where the options are fixed, and only the high-level policy that selects among these options needs to be learned. Surprisingly, the more realistic scenario where both the high-level and low-level policies are learned has been largely overlooked from a theoretical perspective. This dissertation takes a step toward addressing this gap. Focusing on the finite-horizon setting, we present three provably efficient algorithms to advance the theoretical understanding of this scenario. We examine a specific family of hierarchies defined by two levels, formalising the high-level problem as a Semi-Markov Decision Process (SMDP), while the low-level problem involves learning in a set of finite-horizon MDPs, structured according to the hierarchical definition used. In the first two approaches, the hierarchical structure is defined by a set of options, characterised by their initiation sets and termination conditions. The third approach considers goal-based MDPs with sparse rewards, where the structure is predefined and subdivides the original MDP into a high-level MDP and a set of low-level MDPs, each associated with specific sub-goals. We initially propose a method inspired by the Explore-Then-Commit approach from bandit literature. This method first learns the options' policies and then exploits them to learn the high-level policy, which selects among these options. The second algorithm addresses the simultaneous learning of both levels, mitigating the inherent non-stationarity by continuously alternating between two phases: in each phase, one level learns while the other is kept fixed. Finally, the third method leverages a low-level regret minimiser in each episode to learn the low-level policies for the sub-goals assigned by the high-level policy. The high-level policy plans over the high-level MDP, using the uncertainty propagated by the value function of the low-level policies. The derived bounds are compared with the lower bounds for non-hierarchical finite-horizon problems. This allows us to identify problem structures where hierarchical approaches are provably preferable to standard methods that ignore hierarchical structure, even when no pre-trained low-level policies are available. Lastly, we compare a customised Deep Reinforcement Learning approach with a hierarchical one in a practical, realistic problem: mission planning for a team of autonomous aircraft. We demonstrate how, consistent with the theoretical findings, the hierarchical approach, which exploits structural information, scales effectively with the complexity of the scenario.| File | Dimensione | Formato | |
|---|---|---|---|
|
thesis.pdf
accessibile in internet solo dagli utenti autorizzati
Dimensione
16.96 MB
Formato
Adobe PDF
|
16.96 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/241258