The reliability of online information has become a critical concern due to the open and decentralized nature of the Web, where anyone can publish content regardless of its factual accuracy. This thesis addresses the challenge of distinguishing trustworthy data from conflicting or inconsistent sources by developing an automated system ,that integrates web scraping, semantic information extraction, and truth discovery algorithms. The methodology distinguishes between two complementary research modes. The first is a free-text search, where the system retrieves information from heterogeneous online sources and evaluates its consistency and trustworthiness. The second is a file-based validation, where users provide structured datasets and the system verifies their accuracy against external references. In both cases, retrieved content is processed through large language models and analyzed using iterative inference algorithms (TruthFinder and a custom graph-based method) to identify the most reliable values and assess the credibility of sources. An archive of evaluations is continuously updated, enabling long-term monitoring of trustworthiness across different domains. The system has been implemented as a modular web application that allows users to query, filter, and explore reliable sources and data. Thanks to its domain-agnostic design, the system can be applied to diverse contexts, supporting safer and more informed decision-making. Test scenarios, including domain-specific case studies, demonstrate the tool’s effectiveness in detecting inconsistencies, validating datasets, and ranking sources according to their reliability. Results confirm that the proposed approach improves access to credible information and mitigates the risks posed by conflicting online content.
L’affidabilità delle informazioni online è diventata una questione critica a causa della natura aperta e decentralizzata del Web, dove chiunque può pubblicare contenuti indipendentemente dalla loro accuratezza. Questa tesi affronta la sfida di distinguere i dati affidabili da quelli contrastanti o incoerenti sviluppando un sistema automatico che integra tecniche di web scraping, estrazione semantica dei parametri e algoritmi di truth discovery. La metodologia distingue due modalità di ricerca complementari. La prima è la ricerca libera, in cui il sistema recupera informazioni da fonti online eterogenee e ne valuta coerenza e attendibilità. La seconda è la validazione basata su file, in cui l’utente fornisce dataset strutturati e il sistema ne verifica l’accuratezza rispetto a fonti esterne di riferimento. In entrambi i casi, i contenuti recuperati vengono elaborati tramite modelli linguistici di grandi dimensioni e analizzati con algoritmi iterativi di inferenza (TruthFinder e un metodo personalizzato basato su grafi) per identificare i valori più affidabili e valutare la credibilità delle fonti. Un archivio delle valutazioni viene continuamente aggiornato, consentendo un monitoraggio a lungo termine dell’affidabilità nei diversi domini. Il sistema è stato implementato come applicazione web modulare che consente agli utenti di interrogare, filtrare ed esplorare fonti e dati affidabili. Grazie al suo design indipendente dal dominio, il sistema può essere applicato in contesti diversi, supportando decisioni più sicure e informate. Scenari di test, inclusi casi di studio specifici di dominio, dimostrano la sua efficacia nell’individuare incongruenze, validare dataset e classificare le fonti in base alla loro affidabilità. I risultati confermano che l’approccio proposto migliora l’accesso a informazioni credibili e attenua i rischi derivanti da contenuti online contrastanti.
Evaluating the reliability of web-based information: a modular system for truth discovery
BETTI, LEONARDO
2024/2025
Abstract
The reliability of online information has become a critical concern due to the open and decentralized nature of the Web, where anyone can publish content regardless of its factual accuracy. This thesis addresses the challenge of distinguishing trustworthy data from conflicting or inconsistent sources by developing an automated system ,that integrates web scraping, semantic information extraction, and truth discovery algorithms. The methodology distinguishes between two complementary research modes. The first is a free-text search, where the system retrieves information from heterogeneous online sources and evaluates its consistency and trustworthiness. The second is a file-based validation, where users provide structured datasets and the system verifies their accuracy against external references. In both cases, retrieved content is processed through large language models and analyzed using iterative inference algorithms (TruthFinder and a custom graph-based method) to identify the most reliable values and assess the credibility of sources. An archive of evaluations is continuously updated, enabling long-term monitoring of trustworthiness across different domains. The system has been implemented as a modular web application that allows users to query, filter, and explore reliable sources and data. Thanks to its domain-agnostic design, the system can be applied to diverse contexts, supporting safer and more informed decision-making. Test scenarios, including domain-specific case studies, demonstrate the tool’s effectiveness in detecting inconsistencies, validating datasets, and ranking sources according to their reliability. Results confirm that the proposed approach improves access to credible information and mitigates the risks posed by conflicting online content.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_10_Betti_01.pdf
accessibile in internet solo dagli utenti autorizzati
Dimensione
8.98 MB
Formato
Adobe PDF
|
8.98 MB | Adobe PDF | Visualizza/Apri |
|
2025_10_Betti_Executive_Summary_02.pdf
accessibile in internet solo dagli utenti autorizzati
Dimensione
431.09 kB
Formato
Adobe PDF
|
431.09 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/243775