On the use of callgraphs for binary analysis

Reverse engineering is the process of analyzing a system or a binary file in order to identify its components, their interaction, and to obtain a representation of the system at a higher level of abstraction. This process allows to understand the inner workings of software, where no source code is available (e.g., malware, proprietary software, firmwares). Despite many efforts from the community, reverse engineering is still a very time consuming process: a high degree of manual interaction is usually required, and researchers have at their disposal a limited set of analyses and tools. Building upon the current state of the art solutions in binary analysis and graph theory, we developed a novel clustering technique and a scalable graph comparison algorithm. The first contribution enables the automated extraction of libraries embedded in binaries where no previous knowledge is available (i.e. no debug symbols are present). The second one is an efficient labeling methodology which relies on callgraphs extracted from binaries. We conclude the study with the evaluation of these techniques on an extensive dataset of libraries and generic binaries. We have implemented our analysis techniques and evaluated them against a dataset composed of 60 generic binaries and 1644 static libraries. We evaluated the clustering algorithm against binaries generated from our dataset of static libraries, being able to extract correctly the 70.19% of the static libraries. We also conducted an extensive evaluation of the labeling methodology on callgraphs extracted from the libraries in our dataset. The labeling methodology proved to be able to provide the correct label in the top 5 results 98.20% of the times for callgraphs with 10 or more nodes. Finally we have evaluated the labeling methodology against generic binaries being able to provide the correct label in the top 3 results almost 100% of the times.

Reverse engineering consiste nel processo di analisi di sistemi o file binari al fine di identificare i componenti costituenti, la loro interazione e di ottenere una rappresentazione del sistema ad un livello di astrazione più elevato. Questo processo permette di capire il funzionamento interno di software quando il codice sorgente non è disponibile (e.g., malware, software proprietario, firmwares). Nonostante i molti forzi della comunità, reverse engineering rimane ancora un processo che richiede molto tempo: normalmente è richiesto un alto grado di interazione manuale e i ricercatori hanno a loro disposizione un insieme limitato di strumenti. Partendo dal corrente stato dell'arte nel campo della binary analysis e della graph theory abbiamo sviluppato una nuova tecnica di clustering e un algoritmo scalabile per il confronto di callgraph. Il nostro primo contributo riguarda l'algoritmo di clustering che permette l'estrazione automatica di librerie inserite in file binary quando nessuna informazione preliminare è disponibile (i.e. non sono presenti simboli di debug). Il secondo contributo è un efficiente algoritmo di etichettatura che utilizza callgraph estratti da file binari. Abbiamo concluso il nostro studio con la valutazione di queste tecniche su un esteso dataset composto di librerie e file binari. Abbiamo implementato le nostre tecniche di analisi e le abbiamo testate su un dataset composto di 60 binari generici e $1644$ librerie statiche. Abbiamo valutato l'algoritmo di clustering utilizzando binari generati dal nostro dataset di librerie statiche, riuscendo ad estrarre correttamente il $70.19\%$ delle librerie statiche. Abbiamo inoltre condotto un'estensiva valutazione della tecnica di etichettatura su callgraph estratti dalle librerie nel dataset. La tecnica di etichettatura è stata in grado di fornire l'etichetta corretta nei primi 5 risultati nel $98.20\%$ dei casi per callgraph con 10 o più nodi. Infine abbiamo valutato la tecnica di etichettatura su file binari riuscendo a fornire sempre l'etichetta nei primi 3 risultati.