Time series anomaly detection for CERN large-scale computing infrastructure

Anomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on a threshold-based alarming systems carefully set by the system managers on performance time series metrics of each infrastructure component. The goal of this work is to relieve the burden of this complex task and explore fully automated machine learning solutions in the Anomaly Detection field. Moreover, in virtually every real industrial scenario, labeled data to train supervised machine learning methods are unavailable due to their high cost or difficulties in their collection. Therefore our focus is on fully unsupervised Anomaly Detection methods and we explore the current state-of-the-art including both traditional Anomaly Detection ones and also recent successful Deep Anomaly Detection approaches. In this work we proposed novel formulations of Time Series specific approaches (CNN Forecaster, VAR Forecaster) and adaptations to reuse traditional machine learning methods (LOF, OCSVM, IFOREST, KNN, PCA) and Deep Learning ones (Autoencoder Fully Connected, CNN Autoencoder, LSTM Autoencoder) with time series data. In addition we explore six ensemble strategies to combine the individual algorithm strengths. We then present a comparative study of these 10 individual methods and 6 ensemble strategies on the CERN use case for identifying the best approach for the specific problem's characteristics of the CERN large-scale computing infrastructure. In addition, given the absence of labelled data we put in place an annotation system to make possible to collect two new Anomaly Detection for Time Series datasets representing two different CERN user categories. The results of this study in terms of ROC-AUC and training time makes a strong point in favour of the traditional methods that for the specific problem at hand work extremely well; on the other hand we also noticed that they tend to be over-performed by deep methods whenever the time series patterns for normal instances become less trivial. In parallel with the comparative study we also produced an Open Source Proof-of-Concept Anomaly Detection system.

Il rilevamento delle anomalie nel Data Center del CERN è un compito impegnativo a causa della vasta scala dell’infrastruttura di elaborazione e del grande volume di dati da monitorare. Al CERN, l’attuale soluzione per individuare macchine server anomale nell’infrastruttura informatica si basa su sistemi di allarme basati su soglie impostati con cura dai gestori di sistema sulle metriche delle serie temporali delle prestazioni di ciascun componente dell’infrastruttura. L’obiettivo di questo lavoro è alleviare il peso di questa complessa attività ed esplorare soluzioni di Machine Learning completamente automatizzate nel campo del rilevamento delle anomalie. Inoltre, praticamente in ogni scenario industriale reale, i dati etichettati per addestrare metodi di apprendimento automatico supervisionati non sono disponibili a causa del loro costo elevato o delle difficoltà nella loro raccolta. Pertanto, il nostro focus è sui metodi di rilevamento delle anomalie completamente non supervisionati ed esploriamo lo stato dell’arte attuale, inclusi quelli tradizionali di rilevamento delle anomalie e anche i recenti approcci di rilevamento delle anomalie basati su Reti Neurali. In questo lavoro abbiamo proposto nuove formulazioni di approcci specifici per serie temporali (CNN Forecaster, VAR Forecaster) e adattamenti per riutilizzare metodi di Machine Learning tradizionali (LOF, OCSVM, IFOREST, KNN, PCA) e Deep Learning (Autoencoder Fully Connected, CNN Autoencoder, LSTM Autoencoder) con dati di serie temporali. Inoltre, esploriamo sei strategie di insieme per combinare i punti di forza dei singoli algoritmi. Presentiamo quindi uno studio comparativo di questi 10 metodi individuali e 6 strategie di insieme sul caso d’uso del CERN per identificare l’approccio migliore per le caratteristiche del problema specifico dell’infrastruttura informatica su larga scala del CERN. Inoltre, data l’assenza di dati etichettati, abbiamo messo in atto un sistema di annotazione per rendere possibile la raccolta di due nuovi set di dati di rilevamento di anomalie per serie temporali che rappresentano due diverse categorie di utenti del CERN. I risultati di questo studio in termini di ROC-AUC e tempo di allenamento costituiscono un punto di forza a favore dei metodi tradizionali che per lo specifico problema in esame funzionano molto bene; d’altra parte abbiamo anche notato che tendono ad essere sorpassati da metodi basati su Reti Neurali ogni volta che i modelli di serie temporali per istanze normali diventano meno banali. Parallelamente allo studio comparativo, abbiamo anche prodotto un sistema Open Source Proof of Concept di rilevamento delle anomalie.