The amount of data collected in recent years by companies in every imaginable productive sector has increased exponentially, reaching a growth rate that organizations are struggling to manage. The primary cause behind this data explosion is the ongoing phenomenon of digitalization: while improving the quality of essential and non-essential services that surround us in our daily lives, it has also introduced serious challenges in terms of environmental impact and computational costs. Pre-processing and cleaning tasks performed on data are very often overlooked or even ignored during software sustainability assessments, even though they require substantial computational resources and contribute significantly to global energy consumption and carbon emissions, further aggravating the situation. Recognizing how these problems are becoming increasingly prevalent and how difficult it could be to resolve them if the situation is not addressed immediately, the scientific community has begun to propose a variety of approaches and recommendations aimed at improving the sustainability of data processing workflows, placing greater emphasis on the impact of each operation. The current project is situated within this context and focuses on a data-centric framework called DIANA, designed to automatically suggest efficient data processing pipelines based on the characteristics of the data provided in input. DIANA enables the user to achieve competitive results by choosing suitable combinations of cleaning techniques, avoiding the hard work of testing many options and the manual pipeline construction effort. The primary objective of this work is to evaluate the trade-offs between model performance and the environmental cost associated with data processing operations. Specifically, it will be verified whether the pipeline recommendations generated by DIANA, originally optimized for predictive performance, are also effective from an environmental sustainability perspective, including potential enhancements to improve the proposals of the system and to take another step forward in the research.
La quantità di dati raccolti negli ultimi anni nei più variegati settori produttivi è aumentata in maniera esponenziale, raggiungendo un tasso di crescita così elevato che le organizzazioni faticano a gestire. La principale ragione dietro questa esplosione di dati è il fenomeno della digitalizzazione: seppur abbia migliorato la qualità dei servizi che arricchiscono le nostre vite, ha anche introdotto una serie di sfide in termini di impatto ambientale e di costi computazionali. Le attività di pre-elaborazione e di pulizia eseguite sui dati vengono molto spesso trascurate nelle valutazioni di sostenibilità dei software, nonostante contribuiscano in modo significativo al consumo globale di energia e alla produzione di emissioni di carbonio, aggravando ulteriormente la situazione. Per via della crescente consapevolezza di come questi problemi si stiano diffondendo e di quanto possa essere difficile risolverli se la situazione non venisse affrontata immediatamente, la comunità scientifica ha iniziato a proporre diversi approcci mirati a migliorare la sostenibilità dei flussi di elaborazione dei dati, tenendo conto dell’impatto di ciascuna operazione eseguita. Il presente progetto si inserisce in questo contesto e si focalizza su un framework chiamato DIANA ed incentrato sui dati, sviluppato per suggerire automaticamente pipeline di elaborazione efficienti costruite in base alle caratteristiche dei dati forniti in input dall’utente. DIANA permette di ottenere risultati competitivi proponendo combinazioni di tecniche di pulizia appropriate, evitando lo sforzo di costruire manualmente le pipeline. L’obiettivo principale di questa tesi consiste nel valutare i compromessi tra le prestazioni del modello e il costo ambientale associato alle operazioni di elaborazione suggerite dal sistema. Nello specifico, verrà verificato se le raccomandazioni generate da DIANA, originariamente ottimizzate per le prestazioni predittive, risultino efficaci anche dal punto di vista della sostenibilità ambientale, includendo eventuali miglioramenti volti a rafforzare le proposte del sistema e a compiere un ulteriore passo in avanti nella ricerca.
Lightweight pipelines: good enough is sometimes better
OGGIONI, SIMONE
2024/2025
Abstract
The amount of data collected in recent years by companies in every imaginable productive sector has increased exponentially, reaching a growth rate that organizations are struggling to manage. The primary cause behind this data explosion is the ongoing phenomenon of digitalization: while improving the quality of essential and non-essential services that surround us in our daily lives, it has also introduced serious challenges in terms of environmental impact and computational costs. Pre-processing and cleaning tasks performed on data are very often overlooked or even ignored during software sustainability assessments, even though they require substantial computational resources and contribute significantly to global energy consumption and carbon emissions, further aggravating the situation. Recognizing how these problems are becoming increasingly prevalent and how difficult it could be to resolve them if the situation is not addressed immediately, the scientific community has begun to propose a variety of approaches and recommendations aimed at improving the sustainability of data processing workflows, placing greater emphasis on the impact of each operation. The current project is situated within this context and focuses on a data-centric framework called DIANA, designed to automatically suggest efficient data processing pipelines based on the characteristics of the data provided in input. DIANA enables the user to achieve competitive results by choosing suitable combinations of cleaning techniques, avoiding the hard work of testing many options and the manual pipeline construction effort. The primary objective of this work is to evaluate the trade-offs between model performance and the environmental cost associated with data processing operations. Specifically, it will be verified whether the pipeline recommendations generated by DIANA, originally optimized for predictive performance, are also effective from an environmental sustainability perspective, including potential enhancements to improve the proposals of the system and to take another step forward in the research.| File | Dimensione | Formato | |
|---|---|---|---|
|
2026_03_Oggioni_Executive Summary.pdf
non accessibile
Descrizione: Executive Summary
Dimensione
647.98 kB
Formato
Adobe PDF
|
647.98 kB | Adobe PDF | Visualizza/Apri |
|
2026_03_Oggioni_Tesi.pdf
non accessibile
Descrizione: Testo della Tesi
Dimensione
6.25 MB
Formato
Adobe PDF
|
6.25 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/251628