End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure

The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data.

Il Data Center privato basato su Openstack al CERN offre risorse, servizi e strumenti Cloud a una comunità scientifica di circa 3.000 utenti del CERN. Nell'infrastruttura circa 14.000 macchine virtuali sono in esecuzione. Esse coprono diversi casi d'uso, dai servizi front-end web ai database e alle piattaforme di analisi. I cloud manager devono assicurarsi che la potenza di calcolo desiderata venga fornita a tutti gli utenti e, per svolgere questo compito, è fondamentale individuare in tempo i server anomali. La precedente soluzione adottata consiste nel monitorare le metriche prestazionali delle macchine tramite un sistema di allarmi basati su threshold statici. In questa tesi, presentiamo il nuovo sistema di Anomaly Detection (AD) attualmente in esecuzione nell'Infrastruttura Cloud del CERN. Date le metriche di serie temporali multivariate menzionate, tre diversi modelli di Machine Learning non supervisionati sono in esecuzione: Isolation Forest, LSTM-AutoEncoder e GRU-AutoEncoder. Utilizzando un approccio ensemble, il sistema propone ai Cloud manager del CERN, quotidianamente e in modo automatico, i server più anomali del giorno precedente. Mostriamo la relativa pipeline end-to-end che viene eseguita dai dati in origine alle anomalie rilevate, i dettagli dell'architettura del sistema, le fasi di pre-processing implementate e le scelte progettuali relative alla nostra soluzione. Inoltre, presentiamo un nuovo dataset di valutazione etichettato relativo al caso di studio del Cloud al CERN e i risultati, rispetto a questo dataset, dei nostri esperimenti che confrontano i tre modelli che utilizziamo nel sistema. In particolare mostriamo che tutti i modelli usati, in termini di AUC-ROC enonostante la loro diversa natura, hanno elevate prestazioni (AUC-ROC > 0.95$), e che, rispetto al precedente system basato su thresholds, hanno tutti maggiore True Positive Rate, per un dato False Positive Rate richiesto dagli operatori del Cloud. Inoltre, confrontiamo le prestazioni temporali dei modelli e mostriamo che l'addestramento è robusto rispetto alla selezione e alle dimensioni dei dati di addestramento.