This thesis addresses one of the most pressing challenges in contemporary data management: enabling accurate dataset discovery and reliable joinability assessment in large-scale, heterogeneous data lakes. To tackle this, it proposes a broker system that unifies and extends two state-of-the-art frameworks — D3L for scalable, multi-evidence dataset retrieval, and FREYJA for efficient, semantically-aware join quality prediction. The research advances the field in three main ways. First, it consolidates the theoretical foundations of table discovery and joinability, offering an integrated taxonomy of existing approaches and clarifying trade-offs between value-, hash-, and profile-based techniques. Second, it extends FREYJA to support numeric and temporal attributes, introducing nine new profiling feature categories that capture statistical, distributional, and semantic characteristics often neglected by text-centric methods. Third, it implements and evaluates both hybrid (D3L + FREYJA) and homogeneous (FREYJA + FREYJA) architectures, applying them to a meteorological data lake case study. Experimental results show that the homogeneous FREYJA + FREYJA configuration consistently outperforms the hybrid baseline, achieving substantial improvements in precision, recall, and F1-score, while also reducing memory usage and execution time through a GPU-accelerated, containerized deployment. The proximity-informed ranking enhancement further improves the interpretability and relevance of returned joins. These findings demonstrate that integrated, type-aware discovery pipelines can reconcile semantic accuracy with computational scalability, providing a robust foundation for future data integration and analysis systems.
Questa tesi affronta una delle sfide più rilevanti nella gestione contemporanea dei dati: abilitare un’efficace scoperta di dataset e una valutazione affidabile della joinabilità in data lake eterogenei e su larga scala. A tale scopo, viene proposto un sistema broker che unifica ed estende due framework allo stato dell’arte — D3L, per il reperimento scalabile di dataset basato su evidenze multiple, e FREYJA, per la previsione efficiente e semanticamente consapevole della qualità delle join. Il lavoro apporta tre contributi principali. In primo luogo, consolida le basi teoriche della table discovery e della joinability, offrendo una tassonomia integrata degli approcci esistenti e chiarendo i compromessi tra tecniche value-, hash- e profile-based. In secondo luogo, estende FREYJA al supporto di attributi numerici e temporali, introducendo nove nuove categorie di feature di profilazione in grado di cogliere caratteristiche statistiche, distribuzionali e semantiche spesso trascurate dai metodi orientati ai dati testuali. In terzo luogo, implementa e valuta architetture ibride (D3L + FREYJA) e omogenee (FREYJA + FREYJA), applicandole a uno studio di caso su un data lake meteorologico. I risultati sperimentali mostrano che la configurazione omogenea FREYJA + FREYJA supera sistematicamente la baseline ibrida, ottenendo miglioramenti significativi in precision, recall e F1-score, oltre a ridurre consumo di memoria e tempi di esecuzione grazie a un’implementazione containerizzata con accelerazione GPU. L’ordinamento basato sulla prossimità migliora ulteriormente l’interpretabilità e la pertinenza delle join restituite. Questi risultati dimostrano che pipeline integrate e type-aware per la scoperta di dati possono conciliare accuratezza semantica e scalabilità computazionale, fornendo una base solida per futuri sistemi di integrazione e analisi dei dati.
Broker system for multi-stage table discovery and joinability analysis
de ALMEIDA MIKI, GABRIEL
2024/2025
Abstract
This thesis addresses one of the most pressing challenges in contemporary data management: enabling accurate dataset discovery and reliable joinability assessment in large-scale, heterogeneous data lakes. To tackle this, it proposes a broker system that unifies and extends two state-of-the-art frameworks — D3L for scalable, multi-evidence dataset retrieval, and FREYJA for efficient, semantically-aware join quality prediction. The research advances the field in three main ways. First, it consolidates the theoretical foundations of table discovery and joinability, offering an integrated taxonomy of existing approaches and clarifying trade-offs between value-, hash-, and profile-based techniques. Second, it extends FREYJA to support numeric and temporal attributes, introducing nine new profiling feature categories that capture statistical, distributional, and semantic characteristics often neglected by text-centric methods. Third, it implements and evaluates both hybrid (D3L + FREYJA) and homogeneous (FREYJA + FREYJA) architectures, applying them to a meteorological data lake case study. Experimental results show that the homogeneous FREYJA + FREYJA configuration consistently outperforms the hybrid baseline, achieving substantial improvements in precision, recall, and F1-score, while also reducing memory usage and execution time through a GPU-accelerated, containerized deployment. The proximity-informed ranking enhancement further improves the interpretability and relevance of returned joins. These findings demonstrate that integrated, type-aware discovery pipelines can reconcile semantic accuracy with computational scalability, providing a robust foundation for future data integration and analysis systems.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_10_de_Almeida_Miki.pdf
accessibile in internet per tutti
Descrizione: Testo della tesi
Dimensione
4.49 MB
Formato
Adobe PDF
|
4.49 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/243893