Modern decision-making increasingly requires adapting to uncertain, dynamic environments, where the underlying conditions are not known in advance and may evolve unpredictably over time. In addition to the inherent difficulty of learning in such settings, decision-makers often face the further challenge of respecting long-term constraints, such as safety requirements, budget or resource limitations. Even more challenging is the scenario where such constraints are unpredictable, not drawn from a fixed distribution, potentially chosen by an adversary trying to penalize the learner. It is crucial to develop techniques able at the same time to handle both the stochastic case, i.e., when the constraints are sampled from fixed but unknown distributions, and the adversarial case, i.e., when the constraints may change arbitrarily over time. In this thesis, we study the framework of online episodic Constrained Markov Decision Processes (CMDPs) under both stochastic and adversarial constraints. Our goal is to build an algorithm able to keep the constraint violation low during the learning process, while maximizing reward. In the stochastic regime, our method achieves sublinear regret and constraint violation without requiring the existence of a strictly feasible solution (Slater's condition). In addition, we provide guarantees in terms of positive constraint violation, which is a stronger metric ignoring negative contributions and thus avoiding cancellations among terms. In the adversarial regime, our algorithm ensures sublinear constraint violation without Slater's condition and achieves sublinear alpha-regret---where alpha is a suitably defined approximation factor---with respect to the unconstrained optimum. We further validate our results through synthetic experiments, showing the practical effectiveness of our approach and comparing its performance with baselines from literature.
I moderni processi decisionali richiedono sempre più spesso la capacità di adattarsi a contesti incerti e dinamici, le cui condizioni non sono note in anticipo e possono evolvere in modo imprevedibile nel tempo. All'intrinseca difficoltà di apprendere in tali scenari, spesso si aggiunge l'ulteriore sfida di rispettare vincoli di lungo periodo, come requisiti di sicurezza, limiti di budget o di risorse. L'apprendimento diventa ancora più complesso quando tali vincoli sono imprevedibili, non derivano da una distribuzione fissa e possono persino essere scelti da un avversario, cioè concepiti in modo da penalizzare l'agente. È dunque cruciale sviluppare tecniche di apprendimento in grado di affrontare contemporaneamente sia il caso stocastico, in cui i vincoli sono campionati da una distribuzione fissa ma sconosciuta, sia il caso avversariale, in cui i vincoli possono variare arbitrariamente nel tempo. In questa tesi studiamo il modello dei Processi Decisionali di Markov Vincolati (Constrained Markov Decision Processes o CMDP) con vincoli di natura sia stocastica che avversariale. L'obiettivo è progettare un algoritmo capace di mantenere bassa la violazione dei vincoli durante il processo di apprendimento, massimizzando al contempo il guadagno ottenuto. Nel regime stocastico, il nostro approccio ottiene regret e violazione dei vincoli sublineari, senza richiedere l'esistenza di una soluzione strettamente ammissibile (condizione di Slater). Inoltre, forniamo garanzie in termini di violazione positiva dei vincoli, una metrica più stringente che ignora i contributi negativi evitando compensazioni artificiali tra i termini. Nel regime avversario, il nostro algoritmo garantisce una violazione sublineare dei vincoli anche in assenza della condizione di Slater, e ottiene un alpha-regret---dove alpha è un opportuno fattore di approssimazione---sublineare rispetto all'ottimo non vincolato. Infine, validiamo i nostri risultati tramite esperimenti sintetici, dimostrando l'efficacia pratica dell'approccio e confrontandone la performance con algoritmi di riferimento presenti in letteratura.
Online learning in stochastic and adversarial constrained Markov decision processes
CHIEFARI, ELEONORA FIDELIA
2024/2025
Abstract
Modern decision-making increasingly requires adapting to uncertain, dynamic environments, where the underlying conditions are not known in advance and may evolve unpredictably over time. In addition to the inherent difficulty of learning in such settings, decision-makers often face the further challenge of respecting long-term constraints, such as safety requirements, budget or resource limitations. Even more challenging is the scenario where such constraints are unpredictable, not drawn from a fixed distribution, potentially chosen by an adversary trying to penalize the learner. It is crucial to develop techniques able at the same time to handle both the stochastic case, i.e., when the constraints are sampled from fixed but unknown distributions, and the adversarial case, i.e., when the constraints may change arbitrarily over time. In this thesis, we study the framework of online episodic Constrained Markov Decision Processes (CMDPs) under both stochastic and adversarial constraints. Our goal is to build an algorithm able to keep the constraint violation low during the learning process, while maximizing reward. In the stochastic regime, our method achieves sublinear regret and constraint violation without requiring the existence of a strictly feasible solution (Slater's condition). In addition, we provide guarantees in terms of positive constraint violation, which is a stronger metric ignoring negative contributions and thus avoiding cancellations among terms. In the adversarial regime, our algorithm ensures sublinear constraint violation without Slater's condition and achieves sublinear alpha-regret---where alpha is a suitably defined approximation factor---with respect to the unconstrained optimum. We further validate our results through synthetic experiments, showing the practical effectiveness of our approach and comparing its performance with baselines from literature.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_10_Chiefari_Thesis.pdf
accessibile in internet per tutti
Descrizione: Thesis
Dimensione
3.69 MB
Formato
Adobe PDF
|
3.69 MB | Adobe PDF | Visualizza/Apri |
|
2025_10_Chiefari_Executive_Summary.pdf
accessibile in internet per tutti
Descrizione: Executive Summary
Dimensione
966.24 kB
Formato
Adobe PDF
|
966.24 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/243127