Automating cyber threat analysis with LLMs: a methodology for building and serving knowledge graphs

In the era of increasingly frequent and sophisticated cyber threats, organizations must adopt effective strategies to enhance the security of their systems. Cyber Threat Intelligence (CTI) has an important role in this context, and consists of proactive approaches to identify, analyze, and mitigate cyber threats. However, much of the available CTI information is contained in unstructured textual reports, making large-scale analysis organizationally complex and resource-demanding in terms of time and finances. This thesis presents an innovative approach that integrates Large Language Models (LLMs) to perform CTI automation. The proposed system consists of two main components: a pipeline for extracting structured information from unstructured CTI reports, making use of Named Entity Recognition (NER) and Relation Extraction (RE) techniques to construct a Knowledge Graph (KG); and a RAG-based conversational model that uses both Global Knowledge from the created Knowledge Graph and from a CVE database, and Local Knowledge related to the specific organization's software configuration, to generate Cybersecurity assessments. Experimental evaluation was conducted on two fronts. The performance of different BERT-based models for NER and RE tasks was compared against the state-of-the-art LADDER framework, demonstrating competitive results. The effectiveness of the RAG approach was assessed through both qualitative and quantitative metrics, comparing it to existing solutions such as LocalIntel. The results indicate that the proposed system improves the structuring and contextualization of CTI data, enhancing the accuracy and relevance of generated responses. This work contributes to the field by introducing an efficient pipeline for automated threat intelligence processing, advancing the integration of knowledge graphs in Cybersecurity applications.

Nel contesto odierno, caratterizzato da minacce informatiche sempre più frequenti e sofisticate, le organizzazioni sono chiamate ad adottare strategie avanzate per rafforzare la sicurezza dei propri sistemi. La Cyber Threat Intelligence (CTI) ricopre un ruolo cruciale in questo scenario, basandosi su approcci proattivi per identificare, analizzare e contrastare le minacce informatiche. Tuttavia, una parte significativa delle informazioni di CTI è contenuta in report testuali non strutturati, il che rende l'analisi su larga scala complessa, impegnativa in termini di risorse e onerosa dal punto di vista del tempo. Questa tesi presenta un approccio innovativo che integra i Large Language Models per l'automazione della Threat Intelligence. Il sistema proposto è composto da due componenti principali: una pipeline per l'estrazione di informazioni strutturate da report di CTI non strutturati, che utilizza tecniche di Named Entity Recognition (NER) e Relation Extraction (RE) per costruire un Knowledge Graph (KG); e un modello conversazionale basato su RAG, che sfrutta sia una conoscenza globale (Global Knowledge) proveniente dal grafo creato e da un database di vulnerabilità note, sia la conoscenza locale (Local Knowledge) relativa ai software specificati dall'utente per generare risposte nel dominio della Cybersecurity. La valutazione sperimentale è stata condotta su due fronti. In primo luogo, le prestazioni di diversi modelli basati su BERT, utilizzati per i compiti di NER e RE, sono state confrontate con quelle di LADDER, un sistema rappresentante lo stato dell'arte, dimostrando risultati competitivi. Inoltre, l'efficacia dell'approccio RAG è stata valutata attraverso metriche sia qualitative che quantitative, confrontando le sue performance con quelle di LocalIntel, un sistema di riferimento nello stato dell'arte. I risultati hanno mostrato che il sistema proposto utilizza efficacemente i dati contestuali per fornire indicazioni precise e rilevanti sulle minacce. Questo lavoro contribuisce al settore introducendo una pipeline efficiente per l'elaborazione automatizzata della threat intelligence e avanzando l'integrazione dei Knowledge Graph nelle applicazioni di Cybersecurity.