Towards automated cutting of binaries

Reverse engineering of binary files is the process by which analysts try reconstruct the interactions and the purpose of the various parts of a software; it is a complex, time-consuming task, which requires experienced analysts. This is particularly true in the context of embedded firmwares: many different factors usually force the developers to release statically linked, stripped firmwares. In these kinds of binaries, where many functions are presents and the debug symbols have been removed, it's very difficult to identify and recognize the linked functions, what's their purpose, and if they're coming from a specific library. Hints about the presence of known library functions are important, since they can lead to a more efficient reversing process. The purpose of this thesis is to develop an efficient, automatic techniques for the identification of portion of code coming from known libraries. This information can give hints to the analyst about the purpose of the functions present in that portion of code, and possibly helps a successive, more focused, function recognition phase. We identified two different sub-problems: first of all, we had to identify clusters of functions with the same source library; then, we identify such library. In order to do this, we organized our tool in two different parts: - A technique for locating the boundaries between portions of code with different origin, and consequently create clusters of functions with common origin. - An algorithm that, starting from the features extracted from the clusters, identifies the source library. To assess the correctness of the results obtained, we have constructed a significant and well documented dataset of embedded binaries that we will use as our ground truth. We have tested the clustering part on our firmwares dataset, proving that the resulting clusters have an high degree of internal coherence, with almost 80% of the functions in a cluster with the same source library, and a good value of overall accuracy (about 86%). At the same way, we have tested the identification part, through which we were able to correctly recognize the source library of more than half of the functions in the binaries.

Il reverse engineering dei file binari è il processo attraverso il quale gli analisti cercano di ricostruire le interazioni e lo scopo delle varie parti di un software; è un compito complesso e dispendioso in termini di tempo, che richiede analisti esperti. Ciò è particolarmente vero nel contesto dei firmware per sistemi embedded: diversi fattori portano gli sviluppatori a rilasciare firmware linkati staticamente e strippati. In questo tipo di file binari, in cui sono presenti molte funzioni e i simboli di debug sono stati rimossi, è molto difficile identificare e riconoscere le funzioni linkate, qual è il loro scopo e se provengono da una libreria specifica. Indizi sulla presenza di funzioni di libreria note sono importanti, poiché possono portare a un processo di reversing più efficiente. Lo scopo di questa tesi è lo sviluppo di tecniche efficienti e automatiche per l'identificazione di porzioni di codice provenienti da librerie conosciute. Questa informazione può dare suggerimenti all'analista circa lo scopo delle funzioni presenti in quella porzione di codice e, eventualmente, aiutare in una successiva e più mirata fase di riconoscimento della funzione. Abbiamo identificato due diversi sotto-problemi: prima di tutto, dovevamo identificare i cluster di funzioni con la stessa libreria sorgente, e quindi identificare tale libreria. Per fare ciò, abbiamo organizzato il nostro sistema in due parti: - Una tecnica per localizzare i confini tra porzioni di codice con origine diversa, e conseguentemente creare gruppi di funzioni con origine comune. - Un algoritmo che, a partire dalle caratteristiche estratte dai cluster, identifica la libreria di origine. Per valutare la correttezza dei risultati ottenuti, abbiamo costruito un dataset ben documentato di firmware per sistemi embedded che utilizzeremo come ground truth. Abbiamo testato la parte di clustering sul nostro dataset di firmware, dimostrando che i cluster risultanti hanno un alto grado di coerenza interna, con quasi l'80% delle funzioni in un cluster aventi la stessa libreria sorgente, e un buon valore di accuracy complessiva (circa 86 %). Allo stesso modo, abbiamo testato la parte di matching, attraverso la quale siamo stati in grado di riconoscere correttamente la libreria sorgente di oltre la metà delle funzioni nei binari.