IcedHops: reducing read and write latency in an Iceberg-backed offline feature store

The growing need for efficient data management in Machine Learning (ML) workflows has led to the widespread adoption of feature stores, centralized data platforms that supports feature engineering, model training and prediction inference. The Hopsworks' feature store has demonstrated outperformance compared to its alternatives, leveraging Apache Hudi and Spark for offline data storage, but suffers from high write and read latency, even for small quantities of data (1GB or less). This thesis explores the potential of Apache Iceberg as an alternative table format to reduce latencies, developing "IcedHops", an integration of HopsFS (Hopsworks HDFS distribution) and PyIceberg Python library. The research begin with an evaluation of potential system integration alternatives, documenting the advantages and limitations of each approach. Then, IcedHops is implemented and evaluated, benchmarking it against the existing Spark-based solution and an alternative Delta Lake implementation (delta-rs). Extensive experiments were conducted across varying table sizes and CPU configurations to assess write and read performance. Results show that IcedHops significantly reduces write latency -- from 40 to 140 times lower than the legacy system -- and read latency -- from 55\% to 60 times lower than the legacy system. Compared to delta-rs, IcedHops demonstrates reduced write latency for large tables -- up to 7 times lower -- and equal read latency, but exhibits lower scaling benefits with additional CPU cores -- 20\% less than delta-rs. These findings confirm that alternatives to Spark-based pipelines in small-scale scenarios are possible and are worth of further investigations, and the system implemented will be included in the Hopsworks feature store. Furthermore, this thesis work and results finally provides a baseline for future work about additional open table formats, alternative languages to mitigate Python's performance overhead, and strategies to improve resource utilization in data management platforms.

La crescente necessità di piattaforme per una efficiente gestione dei dati per applicazioni di Machine Learning (ML) ha portato a un'ampia diffusione dei feature store, piattaforme dati centralizzate che supportano feature engineering, addestramento di modelli e inferenza. Il feature store di Hopsworks ha dimostrato prestazioni superiori rispetto alle sue alternative, sfruttando Apache Hudi e Spark per il suo offline feature store. Tuttavia, questo sistema soffre di un'elevata latenza di scrittura e lettura, anche per piccole quantità di dati (1GB o inferiori). Questa tesi esplora l'uso di Apache Iceberg come table format alternativo per ridurre tali latenze, sviluppando "IcedHops", una integrazione tra HopsFS (distribuzione Hopsworks di HDFS) e la libreria Python PyIceberg. Questa ricerca inizia con un'analisi delle possibili strategie di integrazione, documentando i vantaggi e i limiti di ciascun approccio. Successivamente, IcedHops viene sviluppato ed evaluato, confrontandolo con la soluzione esistente basata su Spark e con un'alternativa basata su Delta Lake (delta-rs). Esperimenti approfonditi sono stati condotti su tabelle di diverse dimensioni e diverse configurazioni CPU per misurare le prestazioni di scrittura e lettura. I risultati mostrano che IcedHops riduce significativamente la latenza di scrittura -- da 40 a 140 volte inferiore rispetto al sistema legacy -- e la latenza di lettura -- dal 55% a 60 volte inferiore. Rispetto a delta-rs, IcedHops offre una latenza di scrittura inferiore per le tabelle più grandi -- fino a 7 volte minore -- e prestazioni di lettura equivalenti, ma mostra minori vantaggi di scalabilità con l'aumento dei CPU cores. Questi risultati confermano che alternative ad architetture basate su Spark, per gestire dati su piccola scala, esistono e sono più efficienti, e dunque il sistema sviluppato verrà integrato nel feature store di Hopsworks. Inoltre, questa tesi fornisce una base per futuri studi su nuovi open table format e strategie da adottare per ottimizzare l'uso delle risorse nelle piattaforme di gestione dei dati.