Unsupervised anomaly detection on multiprocess event time-series

Establishing whether the observed data are anomalous or not is an important task that has been widely investigated in literature, and it becomes an even more complex prob- lem if combined with high dimensional representations and multiple sources independently generating the patterns to be analyzed. The work presented in this master thesis em- ploys a data-driven pipeline for the definition of a recur- rent auto-encoder architecture to analyze, in an unsuper- vised fashion, high-dimensional event time-series generated by multiple and variable processes interacting with a sys- tem. The analysis of log files that record events of inter- action between users and the radio network infrastructure is employed as real-world case-study for the given prob- lem. The work proposes a pipeline, to deal with the com- plex representation of the data source and the definition and tuning of the anomaly detection model, that is based on no domain-specific knowledge and can thus be adapted to different problem settings. The model has been imple- mented in four different variants that have been evaluated over both normal and anomalous data, gathered partially from real network cells and partially from the simulation of anomalous behaviours. The results show the applicability of the model for the detection of anomalous sequences and events in the described setting, and their deeper interpreta- tion gives insights about the difference between the variants of the model and thus, their limitations and strong points.

Stabilire se i dati osservati siano anomali o meno è un task importante che è stato ampiamente studiato in letteratura, e diventa un problema ancora più complesso se combinato con una alta dimensionalità dei dati e con una moltitudine di processi indipendenti che generano i pattern da analizzare. Il lavoro presentato in questa tesi utilizza una data-driven pipeline per la identificazione di un recurrent auto-encoder per l'analisi senza supervisione di serie di eventi ad alta dimensionalità generati da molteplici e variabili processi che interagiscono con un sistema. L'analisi di log files che registrato gli eventi di interazione tra gli utenti e l'infrastruttura radio di accesso alla rete è stata impiegata come caso di studio reale per il problema identificato. Il lavoro propone una pipeline che gestisce la rappresentazione complessa dei dati e la definizione e settaggio del modello per identificazione delle anomalie, senza basarsi su nessuna conoscenza specifica di dominio e quindi potenzialmente adattabile a diversi contesti. Il modello è stato implementato in quattro varianti differenti che sono state valutate sia con dati normali che anomali, in parte raccolti da celle reali e in parte generati tramite la simulazione di comportamenti anomali. I risultati mostrano l'applicabilità del modello per la classificazione di eventi e sequenze anomale nel contesto descritto, e la loro più dettagliata analisi fornisce interessanti interpretazioni riguardo la differenza tra i vari modelli, le loro limitazioni e punti di forza.