An architecture for the preprocessing and analysis of event logs in cybersecurity

The continuous increase in cyber attacks over the last decade has brought to light how important cybersecurity is for companies and organizations. The attention of attackers in recent years has focused on the IT systems of government structures and municipalities, which offer fundamental services for citizens and store sensitive data, but very often do not have cutting-edge security systems and are vulnerable to cyber attacks. Several cyber attack detection systems have been developed, which allow you to identify an attack in real time through log analysis, but most of them are based on the identification of known malicious patterns within the logs. This way, known threats can be easily identified, but the technique is vulnerable to new attack methods. A possible alternative to overcome this problem is to accumulate and analyze large groups of logs, in such a way as to have an overall view of the data flow, but by doing so it is not possible to act on attacks in real time. The work described in this thesis aims to define an architecture that allows you to analyze event logs in order to detect cyber attacks by combining the two methods just described, following the concept of Lambda Architecture. This type of architecture splits the computation into two layers that are executed in parallel: one layer (Speed) carries out calculations on the data received in real time, while the other layer (Batch) carries out calculations on the entire set of collected data. In the case presented in this work, when a new log is recorded, it is forwarded in parallel to both layers: in the Speed layer, the log is processed using anomaly detection algorithms to check whether it corresponds to an attack; in the Batch layer the log is processed and inserted into a Data Warehouse for more broad-spectrum analysis of the entire log flow.

Il continuo incremento degli attacchi informatici avvenuto nell’ultimo decennio ha portato alla luce quanto sia importante la sicurezza informatica per aziende o organizzazioni. L’attenzione degli attaccanti negli ultimi anni si è concentrata sui sistemi informatici di strutture governative e di comuni, che offrono servizi fondamentali per la cittadinanza e custodiscono dati sensibili, ma molto spesso non hanno sistemi di sicurezza all’avanguardia e sono vulnerabili ad attacchi informatici. Sono stati sviluppati diversi sistemi di rilevamento di attacchi informatici, che permettono di individuare un attacco in tempo reale tramite l’analisi dei log, ma la maggior parte di essi sono basati sull’identificazione di pattern noti a priori come dannosi all’interno dei log. In questo modo è possibile individuare facilmente le minacce conosciute, ma la tecnica è vulnerabile a nuovi metodi di attacco. Una possibile alternativa per ovviare a questo problema è quella di accumulare e analizzare grandi gruppi di log, in maniera tale da avere una visione d’insieme del flusso di dati, ma così facendo non si può agire in tempo reale sugli attacchi. Il lavoro descritto in questa tesi si pone l’obiettivo di definire un’architettura che permetta di analizzare log di rete per poter rilevare attacchi informatici combinando i due metodi appena descritti, ispirandosi al concetto di Architettura Lambda. Questo tipo di architettura sdoppia la computazione su due livelli che vengono eseguiti in parallelo: un livello (Speed) svolge calcoli sui dati ricevuti in tempo reale, mentre l’altro livello (Batch) svolge calcoli sull’intero insieme di dati raccolti. Nel caso presentato in questo lavoro, quando un nuovo log viene registrato, viene mandato in parallelo ad entrambi i livelli: nel livello Speed, il log viene processato utilizzando algoritmi di anomaly detection per verificare se corrisponde ad un attacco; nel livello Batch il log viene processato ed inserito all’interno di un Data Warehouse per analisi più ad ampio spettro sull’intero flusso di log.