Learning from human preferences is a cornerstone of aligning machine learning mod- els with subjective human judgments. However, collecting such preference data remains costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: Reinforcement Learning from Human Feedback (RLHF) scales effectively to high-dimensional tasks such as large language model (LLM) fine-tuning, while Preferential Bayesian Optimization (PBO) achieves superior sample efficiency through active and uncertainty-driven querying. This thesis proposes a hybrid framework that unifies the scalability of RLHF with the query efficiency of PBO by integrating an acquisition-driven selection module into the RLHF pipeline. At its core, the method employs a Laplace-based uncertainty estimation applied to the reward model, providing a principled measure of model confidence that guides the active selection of preference queries. This integration enables a more data- efficient and adaptive human feedback loop, focusing supervision on the most informative comparisons while preserving the scalability of neural architectures. The proposed Bayesian RLHF framework is validated across two representative domains: (i) high-dimensional preference optimization, where traditional PBO methods fail due to high complexity and poor scalability, and (ii) LLM fine-tuning, where annotation budgets are inherently limited. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks. These findings confirm that Bayesian RLHF provides a scalable and uncertainty-aware foundation for efficient pref- erence learning, bridging the gap between active Bayesian optimization and large-scale human feedback alignment.
L’apprendimento basato sulle preferenze umane rappresenta un elemento fondamentale per l’allineamento dei modelli di apprendimento automatico ai giudizi soggettivi degli esseri umani. Tuttavia, la raccolta di tali dati di preferenza risulta costosa e dispendiosa in termini di tempo, rendendo necessaria la definizione di paradigmi di apprendimento più efficienti. Due approcci consolidati offrono vantaggi complementari: il Reinforcement Learning from Human Feedback (RLHF) risulta altamente scalabile per compiti ad alta dimensionalità, come il fine-tuning di modelli linguistici di grandi dimensioni (LLM), mentre la Preferential Bayesian Optimization (PBO) garantisce una maggiore efficienza nell’uso dei campioni grazie a strategie di interrogazione attive basate sull’incertezza. Questa tesi propone un framework ibrido che unisce la scalabilità del RLHF con l’efficienza di interrogazione della PBO, integrando un modulo di selezione guidato da una funzione di acquisizione all’interno della pipeline RLHF. Il metodo si basa su una stima dell’incertezza fondata sull’approssimazione di Laplace, applicata al modello di ricompensa, fornendo una misura rigorosa della confidenza del modello utile a guidare la selezione attiva delle query di preferenza. Questa integrazione consente un ciclo di feedback umano più efficiente e adattivo, concentrando la supervisione sui confronti più informativi e mantenendo al contempo la scalabilità delle architetture neurali. Il framework proposto, denominato Bayesian RLHF, è stato validato su due domini rap- presentativi: (i) l’ottimizzazione di preferenze in spazi ad alta dimensionalità, in cui i metodi PBO tradizionali falliscono a causa dell’elevata complessità computazionale e della scarsa scalabilità; e (ii) il fine-tuning di modelli linguistici di grandi dimensioni (LLM), caratterizzati da budget di annotazione limitati. I risultati sperimentali mostrano miglio- ramenti consistenti sia in termini di efficienza dei campioni che di prestazioni complessive nei diversi scenari analizzati. Questi risultati confermano che il Bayesian RLHF costi- tuisce una base scalabile e consapevole dell’incertezza per l’apprendimento da preferenze umane, colmando il divario tra l’ottimizzazione bayesiana attiva e l’allineamento su larga scala tramite feedback umano.
Active preference learning for RLHF: a Bayesian optimization perspective
CAPRETTI, VALERIA
2024/2025
Abstract
Learning from human preferences is a cornerstone of aligning machine learning mod- els with subjective human judgments. However, collecting such preference data remains costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: Reinforcement Learning from Human Feedback (RLHF) scales effectively to high-dimensional tasks such as large language model (LLM) fine-tuning, while Preferential Bayesian Optimization (PBO) achieves superior sample efficiency through active and uncertainty-driven querying. This thesis proposes a hybrid framework that unifies the scalability of RLHF with the query efficiency of PBO by integrating an acquisition-driven selection module into the RLHF pipeline. At its core, the method employs a Laplace-based uncertainty estimation applied to the reward model, providing a principled measure of model confidence that guides the active selection of preference queries. This integration enables a more data- efficient and adaptive human feedback loop, focusing supervision on the most informative comparisons while preserving the scalability of neural architectures. The proposed Bayesian RLHF framework is validated across two representative domains: (i) high-dimensional preference optimization, where traditional PBO methods fail due to high complexity and poor scalability, and (ii) LLM fine-tuning, where annotation budgets are inherently limited. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks. These findings confirm that Bayesian RLHF provides a scalable and uncertainty-aware foundation for efficient pref- erence learning, bridging the gap between active Bayesian optimization and large-scale human feedback alignment.| File | Dimensione | Formato | |
|---|---|---|---|
|
Thesis_Active_Preference_Learning_for_RLHF__A_Bayesian_Optimization_Perspective.pdf
non accessibile
Descrizione: Thesis
Dimensione
3.77 MB
Formato
Adobe PDF
|
3.77 MB | Adobe PDF | Visualizza/Apri |
|
Executive_Summary___Active_Preference_Learning_for_RLHF__A_Bayesian_Optimization_Perspective.pdf
non accessibile
Descrizione: Executive summary
Dimensione
1.19 MB
Formato
Adobe PDF
|
1.19 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/246833