Cerberus from the proof of concept to the real system

Botnets are networks of infected machines controlled by an external entity, the botmaster, who uses this infrastructures for malicious activities (i.e. spamming and Distributed Denial Of Service). The botmaster employs a machine, the Command and Control Server (C&C), to send commands to, and gather information from, the bots. The communication between the bots and the C&C is established through a variety of protocols that change from botnet to botnet. In the case of DGA-based botnets, the protocol used to find the rendezvous point between the bots and the botmaster is a Domain Generation Algorithm (DGA). The mitigation of a botnet is a topic widely covered in literature but several proposed systems suffer from the major shortcomings of either using a supervised approach, which means the system needs some a priori knowledge, or leveraging DNS data that contain informations on the infected machines, which leads to users' privacy issues. We have concentrated on CERBERUS, an automated system based on machine learning that overcomes such shortcomings thanks to an unsupervised approach, that means the system does not need any a priori knowledge to analyze passive DNS data free of any privacy issues. CERBERUS is proven to be an effective system to discover botnets. Not only have we managed to make CERBERUS, a proof of concept so far, a real working system able to operate on its own but we have also deployed it in the real world. In order to do so we have added a new module feeding the system with data collected from a passive DNS sensor and we have partially reorganized the program flow to achieve our goals in terms of performance and robustness. The new system is now multiprocess so it can maximize the use of computing resources and it is scalable. Moreover, it can recover from errors which may occur without compromising the integrity of the system. CERBERUS is now working around the clock analyzing DNS data and classifying malicious domains to discover new threats. We constantly monitored CERBERUS's work for two weeks and it was able to analyze 13.506.000 domains and classified 144 of them as malicious. After that we let CERBERUS work on its own for two more weeks and we checked whether the system was still working, which allowed us to prove that the system is autonomous.

Le botnet sono reti di macchine infette controllate da un botmaster, una persona in grado di controllare queste macchine solitamente per attività malevole a fini di lucro (ad esempio spamming e attacchi DDOS, Distributed Denial Of Service). Il botmaster utilizza una macchina, detta Command and Control Server (C&C Server) per inviare comandi ai bot e raccogliere informazioni da essi. La comunicazione tra i bot e il C&C Server avviene attraverso diversi protocolli che variano da botnet a botnet. Nel caso di botnet centrallizate (le più diffuse), il protocollo più utilizzato per stabilire la comunicazione tra bot e botmaster è il Domain Generation Algorithm (DGA), questo protocollo genera automaticamente molti domini casuali a cui i bot tentano di connettersi tramite richieste di tipo HTTP, il botmaster registra uno solo di questi domini e aspetta che i bot lo contattino per stabilire la connessione. I domini generati cambiano dopo un certo tempo, ad esempio un giorno, e il botmaster registra ogni volta un dominio diverso in modo da rendere più difficile l'individuazione del C&C Server che è il single point of failure dell'infrastruttura. La mitigazione delle botnet è un argomento ampiamente trattato in letteratura ma molti dei sistemi proposti utilizzano o un approccio supervisionato, ovvero il sistema ha bisogno di dati etichettati e di una conoscenza base da cui partire per classificare i dati, oppure utilizzano informazioni riguardanti gli indirizzi IP degli utenti infetti, il che comporta problemi relativi alla privacy degli utenti oltre che difficoltà nel dislocamento di un sistema di monitoraggio a livelli bassi della gerarchia DNS. Noi ci siamo concentrati su Cerberus, un sistema automatico basato sul machine learning che supera queste limitazioni grazie ad un approccio non supervisionato, ovvero non ha bisogno di nessuna conoscenza a priori, e che analizza dati DNS passivi e quindi liberi da problemi di privacy. Cerberus ha dato prova di essere un sistema efficace per scoprire nuove botnet. Noi ci siamo occupati di rendere Cerberus, finora solo un proof of concept, un sistema reale capace di operare da solo nel mondo reale analizzando dati in tempo reale. Per raggiungere questo obiettivo abbiamo aggiunto un nuovo modulo che mette a disposizione del sistema dati provenienti da un sensore DNS passivo e abbiamo parzialmente riorganizzato il flusso del programma per soddisfare i nostri requisiti in termini di performance e robustezza. Il nuovo sistema è ora multiprocesso in modo da massimizzare l'utilizzo di risorse computazionali ed è scalabile. Inoltre è in grado di gestire gli errori che possono verificarsi senza comprottere l'integrità del sistema. Cerberus sta ora lavorando a tempo pieno analizzando dati e classificando domini malevoli per scoprire nuove minacce. Abbiamo monitorato costantemente il lavoro di Cerberus per due settimane e il sistema è stato in grado di analizzare 13.506.000 domini e ha classificato 144 di loro come malevoli. Dopo ciò abbiamo lasciato lavorare Cerberus da solo per altre due settimane e abbiamo verificato che il sistema ha operato senza problemi e stava ancora lavorando, questo ci ha permesso di provare che il sistema è autonomo.