Automated machine learning performance monitoring

Nowadays Automated Machine Learning makes available the most advanced techniques to an inexperienced audience, without the need of selecting manually the right algorithm or setting values to the hyper-parameters. However, in production phase, data and context could change during time, worsening the predictive performances of the model. This problem is called in literature with the general term of Data Drift. The goal of this thesis is identify a tool which allows to know when and how a change in data and in the relationship between it and variables to predict happens. Such change is called Concept Drift in the literature. The focus of our work will be on the detection of Concept Drift, since only in this case there is a change in the data that leads to a worsening of the performance of the model, even if we will refer to the problem with the general term of Data Drift, of which the Concept Drift is a special case. For this purpose, we divided our study in two main activities: the search for methodologies of drift detection and the systematic comparison between the identified methods, using an approach that simulate at best a change in data during production phase. For the first we paid our attention on methods that do not actively interfere with the model they are monitoring and using an unsupervised approach, namely they do not need real values of the predictions, since the goal is the use such methods in production and in real time. The second part was been necessary since a comparative between the identified methods used in the way we need does not exist. We have therefore developed a testing framework that allows us to compare Concept Drift Detection methods in a systematic way, artificially adding a Concept Drift to the data sets under exam and calculating descriptive performances of their ability to identify changes in the data.

Al giorno d’oggi l’Automated Machine Learning rende accessibile ad una platea di non esperti le tecniche più avanzate di Machine Learning, senza la necessità di selezionare il giusto algoritmo o assegnare i valori agli iperparametri manualmente. Tuttavia, in fase di produzione, i dati e il contesto possono cambiare nel tempo, peggiorando le performance predittive del modello. Questo problema è chiamato in letteratura con il termine generale di Data Drift. L’obbiettivo di questa tesi è quello di identificare uno strumento che permette di conoscere quando e in che modo c’è un cambiamento nei dati e nella relazione tra essi e la variabile da predire. Tale cambiamento viene chiamato Concept Drift in letteratura. Il focus del nostro lavoro sarà sul rilevamento dei Concept Drift, dato che solo in questo caso vi è un cambiamento nei dati che porta ad un peggioramento delle performance del modello, anche se ci riferiremo al problema con il termine più generale di Data Drift, di cui il Concept Drift è un caso particolare. A questo fine, abbiamo diviso il nostro studio in due attività principali: la ricerca di metodologie di rilevamento di drift nei dati e il confronto sistematico tra i metodi individuati, utilizzando un approccio che simula al meglio un cambiamento nei dati durante la fase di produzione. Per la prima parte l’attenzione è stata posta su metodi che non intervengono attivamente nel modello che monitorano e che utilizzano un approccio non supervisionato, ossia non necessitano dei valori reali delle predizioni, dato che l’obbiettivo è di utilizzare tali metodi in produzione e in tempo reale. La seconda parte è stata necessaria poiché non esiste un comparativa tra i metodi individuati utilizzati nel modo in cui desideriamo. Abbiamo quindi sviluppato un framework di testing che permette di confrontare metodi di Concept Drift Detection in maniera sistematica, aggiungendo artificialmente un Concept Drift ai data set in esame e calcolando delle performance descrittive della loro capacità nell’individuare i cambiamenti nei dati.