Bias sensitivity in Large Language Models : detecting confirmation bias through framing and semantic metrics

This thesis examines the presence of confirmation bias in Large Language Models (LLMs) by studying how their responses vary under different prompt framings. Although models such as GPT-4, Gemini, Phi-4, and Mistral excel at generating coherent and contextually appropriate text, they remain sensitive to the wording of prompts, which may lead them to reinforce biased assumptions. TheresearchaddressesthequestionofwhetherLLMstendtoconfirmtheimplicitpremises ofbiased(leading)questions, comparedtoneutralorcontradictoryformulations. Toinves- tigate this, a workflow was designed consisting of four stages: (i) constructing a dataset of stereotype-drivenquestions, eachrephrasedintoleading, neutral, andcontradictoryforms; (ii) querying multiple LLMs to generate responses under consistent decoding settings; (iii) measuring semantic similarity between prompts and responses using GPT-based scoring and a cross-encoder SAS model; and (iv) computing framing-sensitivity metrics, includ- ing ∆LC (Leading–Contradictory), ∆LN (Leading–Neutral), and the Framing Sensitivity Index (FSI). The main contribution of this work is a replicable evaluation framework that combines se- mantic similarity with bias-sensitivity metrics to quantify confirmatory behavior in LLMs. The thesis also highlights ethical concerns, as biased framing can shape model outputs in subtle but impactful ways.

Questa tesi esamina la presenza di bias di conferma nei Large Language Models (LLM), studiando come le loro risposte variano in base a differenti modalità di formulazione dei prompt. Sebbene modelli come GPT-4, Gemini, Phi-4 e Mistral siano eccellenti nella gen- erazione di testo coerente e contestualmente appropriato, essi rimangono sensibili alla for- mulazione delle domande, rischiando così di rafforzare assunzioni distorte o pregiudiziali. La ricerca affronta la questione se gli LLM tendano a confermare le premesse implicite contenute in domande con formulazione di parte (leading), rispetto a versioni neutrali o contraddittorie. Per indagare questo aspetto, è stato progettato un workflow articolato in quattro fasi: (i) costruzione di un dataset di domande basate su stereotipi, riformu- late in versione leading, neutrale e contraddittoria; (ii) interrogazione di molteplici LLM per generare risposte in condizioni di decodifica consistenti; (iii) misurazione della sim- ilarità semantica tra domande e risposte mediante valutazione con GPT e un modello cross-encoder SAS; e (iv) calcolo di metriche di sensibilità al framing, tra cui ∆LC (Lead- ing–Contradictory), ∆LN (Leading–Neutral) e l’indice di sensibilità al framing (FSI). Il principale contributo di questo lavoro è la definizione di un framework di valutazione replicabile, che combina la similarità semantica con metriche di sensibilità al bias per quantificareicomportamenticonfermatoridegliLLM.Latesievidenziaancheimplicazioni etiche, in quanto un framing distorto può orientare le risposte dei modelli in modi sottili ma significativi.