In recent years an increasing number of applications, IoT sensors and websites have produced endless streams of data. These data streams are not only unbounded, but their characteristics dynamically change over time, generating a phenomenon called concept drift. The standard machine learning models do not work properly in this context and new techniques have been developed in order to tackle these challenges. The purpose of my thesis is twofold. On the one hand it focuses on finding the best approach to be applied in an industrial context; on the other hand, it searches for an original methodology in order to improve the state of the art under specific conditions. From this perspective it studies the application of the Kalman Filter combined with Naïve Bayes for incremental learning and concept drift management. Furthermore it investigates when this new approach, which directly follows the values of data's attributes, is better than the standard strategy, which monitors the performance of the model in order to detect a drift. The new proposed method is evaluated against both artificial and real datasets with concept drift, and its performances are compared with those of the state-of-the-art approaches. Different models are investigated in order to understand which is the model that overall performs better and which is generally more robust in many different situations.
Negli ultimi anni un numero sempre maggiore di applicazioni, sensori IoT e siti web hanno prodotto flussi infiniti di dati. Questi flussi di dati non solo sono illimitati, ma le loro caratteristiche possono cambiare dinamicamente nel tempo, generando un fenomeno chiamato concept drift. I modelli standard di machine learning non funzionano correttamente in queste situazioni. Al fine di affrontare queste sfide sono state sviluppate nuove tecniche. La mia tesi persegue un duplice scopo: da un lato si concentra sulla ricerca del miglior approccio da applicare in un contesto industriale, dall'altro ricerca una metodologia originale al fine di migliorare quelle esistenti in condizioni specifiche. In questa ottica il lavoro studia l'applicazione dei filtri di Kalman combinati con l'algoritmo di machine learning Naïve Bayes, allo scopo di ottenere un algoritmo incrementale che sia in grado di gestire automaticamente il concept drift. Vengono anche analizzati i casi in cui questo nuovo approccio, che monitora direttamente i valori degli attributi dei dati, risulta migliore di quello standard, che controlla le prestazioni del modello in modo tale da rilevare un concept drift. Il nuovo metodo proposto viene valutato rispetto a dataset artificiali e reali con concept drift, e le sue prestazioni vengono confrontate con quelle degli approcci oggi più utilizzati. Vengono dunque studiati diversi algoritmi al fine di capire quale sia il modello che si rivela più robusto e che nell'insieme ottiene le performance migliori.
Incremental naive Bayes with Kalman filtering for learning with concept drift
ZIFFER, GIACOMO
2019/2020
Abstract
In recent years an increasing number of applications, IoT sensors and websites have produced endless streams of data. These data streams are not only unbounded, but their characteristics dynamically change over time, generating a phenomenon called concept drift. The standard machine learning models do not work properly in this context and new techniques have been developed in order to tackle these challenges. The purpose of my thesis is twofold. On the one hand it focuses on finding the best approach to be applied in an industrial context; on the other hand, it searches for an original methodology in order to improve the state of the art under specific conditions. From this perspective it studies the application of the Kalman Filter combined with Naïve Bayes for incremental learning and concept drift management. Furthermore it investigates when this new approach, which directly follows the values of data's attributes, is better than the standard strategy, which monitors the performance of the model in order to detect a drift. The new proposed method is evaluated against both artificial and real datasets with concept drift, and its performances are compared with those of the state-of-the-art approaches. Different models are investigated in order to understand which is the model that overall performs better and which is generally more robust in many different situations.| File | Dimensione | Formato | |
|---|---|---|---|
|
2020_10_Ziffer.pdf
Open Access dal 10/09/2023
Descrizione: Tesi
Dimensione
2.95 MB
Formato
Adobe PDF
|
2.95 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/166774