Machine-learning-assisted failure management in microwave networks

A great part of future generation (5G) services and applications, such as cloud computing, video streaming, and smart working platform, have strong availability constraints and must be always reachable from any point of the network. The increase of usage of these applications forces internet service providers (ISPs) to research new solutions to provide Ultra-Reliable Low-Latency Communication (URLLC). As the presence of failures can affect the availability of the network, failure identification is a crucial element for the service maintenance and must be executed in short time. In our work, we take into consideration the failure management in microwave networks, and we mainly focus on the failure identification problem. Nowadays, the failure identification procedure is carried out by human experts, who typically analyse radio power measures and equipment alarms related to the failure event and, based on experience, identify possible root causes of the failure and proper countermeasures to restore the service. Usually, the amount of data related to the failure events that must be analysed is huge, and URLLC imposes a stringent time constraint to restore the service after a failure, this constraint can be satisfied using data analysis techniques able to work with huge quantity of data in short time. In our work, we opted for techniques from the Machine Learning discipline. To address three main issues in the context of failure management in microwave networks. The first task of this thesis is to develop a failure classifier able to take as input a new network measure, associated to a failure event, and provide as output the root cause that led to the specific event. As supervised learning relies on labelled data, a costly labelling process is typically required, where human experts analyse the data (in our case, the radio performance metrics in microwave links) and identify failure root-causes "manually". On the contrary, collecting unlabelled data can be much easier, as it can leverage the presence of several network monitors measures retrieved from network management systems. Therefore, an accurate automated labelling of data, based on partial knowledge of the labelled data, i.e., on the human-based labelling process, is crucial. So, as a second task, we propose an automated labelling procedure to take advantage from the unlabelled data in the development of the failure cause classifier, and improve classification performance with respect to the case when only a small labelled dataset is used. Finally, as third task of the thesis, we consider a specific subset of the available data. This subset contains labelled data for which the failure root cause is not clearly identifiable evaluating the radio performance metrics, but it is mainly related to a hardware failure on the radio link. We provide an unsupervised learning method to divide this subset of data into multiple sub-groups, with the objective of finding common patterns within the obtained clusters. In all the three tasks we evaluate the effectiveness of various machine learning algorithms, such as ANN, SVM, KNN, K-means, in terms of classification or clustering performance and time complexity.

Gran parte dei servizi e delle applicazioni di nuova generazione (5G), come cloud computing, video streaming e piattaforme di smart working, hanno forti vincoli di disponibilità e devono essere accessibili da ogni punto della rete. La crescita nell'utilizzo di queste applicazioni forza gli internet service provider (ISP) a ricercare nuove soluzioni per fornire Ultra-Reliable Low-Latency Communication (URLLC). Dato che la presenza di failure può influenzare la disponibilità della rete, l'identificazione di questi è un elemento cruciale per la manutenzione del servizio e deve essere eseguita in breve tempo. Nel nostro lavoro prendiamo in considerazione la gestione dei failure in reti microwave e ci focalizziamo principalmente sul problema di identificazione. Oggigiorno la procedura d'identificazione di failure è eseguita da persone esperte, che tipicamente analizzano le misure delle potenze radio e gli allarmi dell'apparecchiatura relativi all'evento di failure e, basandosi sull'esperienza, identificano le sue possibili cause e le possibili contromisure per ristabilire il servizio. Generalmente la quantità di dati relativa agli eventi di failure che devono essere analizzati è enorme e URLLC impone un vincolo di tempo stringente per ristabilire il servizio dopo un failure, il quale può essere soddisfatto utilizzando tecniche di analisi di dati capaci di lavorare con enormi quantità di dati in poco tempo. Nel nostro lavoro, abbiamo optato per tecniche proveninenti dal campo del Machine Learning per risolvere tre problemi fondamentali nel campo della gestione di failure nelle reti microwave. Il primo obiettivo della tesi è lo sviluppo di un classificatore di failure con l'abilità di ricevere in input delle misure di rete, associate ad un evento di failure, e fornire in output la causa che ha portato allo specifico evento. Dato che supervised learning si basa su dati etichettati, un costoso processo di etichettatura è tipicamente richiesto: delle persone esperte analizzano i dati (nel nostro caso, le metriche di performance radio in link microwave) e identificano la cause del failure "manualmente". Al contrario, raccogliere dati non etichettati può essere molto più semplice, dato che si può sfruttare la presenza delle misure di diversi monitor di rete recuperate dai sistemi di gestione della rete. Perciò, un metodo automatico di etichettatura basato sulla conoscenza parziale dei dati etichettati, ovvero sui dati etichettati attraverso il processo manuale, è cruciale. Quindi, per raggiungere il secondo obiettivo proponiamo una procedura di etichettatura automatica per trarre vantaggio dai dati non etichettati nello sviluppo del classificatore di cause di failure, e migliorare le performance di classificazione rispetto al caso in cui viene utilizzato solo un piccolo insieme di dati etichettati. Infine, come terzo obiettivo della tesi, consideriamo uno specifico sottoinsieme dei dati disponibili, il quale contiene dati etichettati per cui la causa di failure non è facilmente definibile valutando le metriche di performance radio, ma è principalmente collegata a failure hardware sul link radio. Presentiamo un metodo di unsupervised learning per dividere questo sottoinsieme di dati in vari sotto gruppi, con l'obiettivo di trovare comportamenti simili all'interno dei cluster ottenuti. In tutte e tre le analisi abbiamo valutato l'efficacia di vari algoritmi di machine learning, come ANN, SVM, KNN, K-means, rispetto alle performance di classificazione o clustering e alla complessità temporale.