Improving the extraction of threat actor behavior from cyber threat intelligence reports

The cybersecurity field is a game between defense and attack. The higher the investment for defense security measures, the more specialized and sophisticated the attack methodologies. Considering that the cyber threat ecosystem is vast, containing both cyber groups and nation-state-sponsored activities, it is necessary to understand what enemies we are facing to maintain a proper defense posture mechanism. To support defense strategies, knowledge about new cyber attacks can be exchanged in the form of cyber threat intelligence (CTI) reports. Nevertheless, due to the demanding manual efforts, detecting and attributing cyber attacks require complex CTI tools able to analyze a huge volume of structured and unstructured textual data. CTI tools adopt different kinds of methodologies to perform context-aware analysis that captures the complexities of natural language. The inherent strength of a CTI tool, in generating high-quality output graphs, lies fundamentally in the capabilities of extracting relevant information from the threat description. In this thesis, we want to verify if, implementing a new approach for extracting actors and their relationships from unstructured textual sources, is possible to enhance the effectiveness of the CTI tools in attack detection and attribution. We extend the existent solution Extractor, since it is the most advanced tool to extract concise attack behaviors as provenance graphs, by introducing two new components: a Named Entity Recognition (NER) for entity extraction and a new data driven approach to classify entity relation based on text meanings. The result of this study is "DynExt" our new CTI framework. DynExt is able to reach an 83\% in recall score representing an increment of 16\% with respect to Extractor, confirming the enhanced capacity in extracting detailed information from diverse unstructured textual input. Our proposed approach not only addresses the existing challenges in cyber threat detection but also significantly enhances the capabilities of CTI tools in staying ahead of cyber threats in an ever-changing landscape.

Il campo della cybersecurity è un gioco tra difesa e attacco. Più alti sono gli investimenti per le misure di sicurezza di difesa, più specializzata e sofisticata è la procedura di attacco. Considerando che l'ecosistema delle minacce informatiche è vasto e contiene sia gruppi informatici che attività sponsorizzate dagli Stati nazionali, è necessario capire quali nemici stiamo affrontando per mantenere un adeguato meccanismo di difesa. Per supportare le strategie di difesa, le conoscenze sui nuovi attacchi informatici possono essere scambiate sotto forma di rapporti di cyber threat intelligence (CTI). Tuttavia, a causa dell'impegnativo sforzo manuale, l'individuazione e l'attribuzione degli attacchi informatici richiedono complessi strumenti CTI in grado di analizzare un enorme volume di dati testuali strutturati e non strutturati. Gli strumenti CTI adottano diversi tipi di metodologie per eseguire un'analisi consapevole del contesto che catturi le complessità del linguaggio naturale. La forza intrinseca di uno strumento CTI, nel generare grafici di output di alta qualità, risiede fondamentalmente nella capacità di estrarre informazioni rilevanti dalla descrizione della minaccia. In questa tesi si vuole verificare se, implementando un nuovo approccio per l'estrazione degli attori e delle loro relazioni da fonti testuali non strutturate, sia possibile migliorare l'efficacia degli strumenti CTI nel rilevamento e nell'attribuzione degli attacchi. Abbiamo esteso la soluzione esistente Extractor, che è lo strumento più avanzato per estrarre i comportamenti di attacco concisi come provenance graph, introducendo due nuovi componenti: un Named Entity Recognition (NER) per l'estrazione delle entità e un nuovo approccio Data Driven per classificare le relazioni tra le entità in base al significato del testo. Il risultato di questo studio è "DynExt", il nostro nuovo framework CTI. DynExt è in grado di raggiungere un punteggio di richiamo dell'83\%, che rappresenta un incremento del 16\% rispetto a Extractor, confermando la maggiore capacità di estrarre informazioni dettagliate da diversi input testuali non strutturati. L'approccio proposto non solo affronta le sfide esistenti nel rilevamento delle minacce informatiche, ma migliora anche in modo significativo le capacità degli strumenti CTI di tenere testa alle minacce informatiche in un panorama in continua evoluzione.