In recent years, Retrieval-Augmented Generation (RAG) has enabled large language models (LLMs) to integrate external knowledge sources, thereby improving factual accuracy and domain-specific adaptability. While managed platforms such as Azure OpenAI Studio, Google Vertex AI, and IBM Watsonx.ai have made commercial RAG systems increasingly accessible, evaluating their effectiveness requires multiple complementary methods, since no single metric is sufficient to capture all aspects of performance. This thesis presents a comparative evaluation of three commercial RAG systems, two generic LLMs, and a prototype RAG developed at Politecnico di Milano, using a dataset of 101 domain-specific questions in the field of energy efficiency. The quality of the generated responses is assessed through three distinct approaches: human judgment, cosine similarity between embeddings, and the LLM-as-a-Judge method. Overall, the study confirms RAG systems as a viable solution for mitigating hallucinations and integrating new knowledge without retraining, although the limited dataset constrains the generalizability of the results and calls for broader future investigations.
Negli ultimi anni, la Retrieval-Augmented Generation (RAG) ha consentito ai modelli linguistici di grandi dimensioni (LLM) di integrare fonti di conoscenza esterne, migliorando così l’accuratezza fattuale e l’adattabilità specifica al dominio. Sebbene piattaforme gestite come Azure OpenAI Studio, Google Vertex AI e IBM Watsonx.ai abbiano reso i sistemi RAG commerciali sempre più accessibili, valutarne l’efficacia richiede metodi molteplici e complementari, poiché nessuna singola metrica è sufficiente a catturare tutti gli aspetti delle prestazioni. Questa tesi presenta una valutazione comparativa di tre sistemi RAG commerciali, due LLM generici e un prototipo RAG sviluppato presso il Politecnico di Milano, utilizzando un dataset di 101 domande specifiche di dominio nel campo dell’efficienza energetica. La qualità delle risposte generate è valutata attraverso tre approcci distinti: giudizio umano, similarità coseno tra gli embedding, e il metodo LLM-as-a-Judge. In generale, lo studio conferma i sistemi RAG come una soluzione valida per mitigare le allucinazioni e integrare nuove conoscenze senza riaddestramento, sebbene il dataset limitato ne riduca la generalizzabilità e richieda indagini future più ampie.
Benchmarking commercial and ad-hoc RAG implementations in the energy domain
ANGELINI, MATTEO
2024/2025
Abstract
In recent years, Retrieval-Augmented Generation (RAG) has enabled large language models (LLMs) to integrate external knowledge sources, thereby improving factual accuracy and domain-specific adaptability. While managed platforms such as Azure OpenAI Studio, Google Vertex AI, and IBM Watsonx.ai have made commercial RAG systems increasingly accessible, evaluating their effectiveness requires multiple complementary methods, since no single metric is sufficient to capture all aspects of performance. This thesis presents a comparative evaluation of three commercial RAG systems, two generic LLMs, and a prototype RAG developed at Politecnico di Milano, using a dataset of 101 domain-specific questions in the field of energy efficiency. The quality of the generated responses is assessed through three distinct approaches: human judgment, cosine similarity between embeddings, and the LLM-as-a-Judge method. Overall, the study confirms RAG systems as a viable solution for mitigating hallucinations and integrating new knowledge without retraining, although the limited dataset constrains the generalizability of the results and calls for broader future investigations.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_10_Angelini.pdf
accessibile in internet per tutti a partire dal 29/09/2026
Descrizione: 2° versione tesi
Dimensione
638.92 kB
Formato
Adobe PDF
|
638.92 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/243870