In this thesis, we focus on the reverse engineering task of identifying functions that have been inlined by the compiler as a result of optimizations. We leveraged BINO and MemRec, two frameworks aimed at addressing this challenge by capturing certain features of assembly functions, such as Control-Flow Graph structure, syntactic and semantic features, and memory accesses. We integrated the two frameworks and adopted two strategies to assess BINO and MemRec results. The main goal consists on discarding false positives, making the final result noticeably more reliable. The first strategy applied consists on calculating the Graph-Edit Distance (GED) between the Control-Flow Graph (CFG) of the basic blocks identified by MemRec, containing inlined code, and the CFGs of the corresponding recognized inlined functions. Our evaluation focuses on assessing the similarity between the Control-Flow Graph of the basic blocks identified by MemRec and the CFG of the function claimed to be inlined. The GED is used to evaluate the number of edit paths required to make a graph isomorphic to another one. The other strategy involves 2 frameworks, Asm2Vec and Palmtree, for the clone search task relying on the vector representation of the assembly code. These frameworks generate vector representations of assembly code for basic blocks claimed to be inlined by BINO or MemRec and the corresponding inlined function. The two vectors are then compared evaluating cosine similarity. We realized a set of models able to classify a match as a true or a false positive, given the Asm2Vec, Palmtree and GED values. We trained the framework on one hundred projects taken of a large GitHub C++ dataset. INReco and the models have been evaluated on 100 C++ binaries. Overall, we achieved a recall of 0.9 with a precision of 0.77.
Questo progetto si concentra sull’identificazione delle chiamate a metodi inlined dal compilatore a seguito di ottimizzazioni. Partendo da BINO e MemRec, due framework che affrontano questa sfida catturando diverse caratteristiche delle funzioni assembly, come la struttura del Control-Flow Graph, le caratteristiche sintattiche e semantiche e gli accessi in memoria, abbiamo fuso i 2 approcci con l’obiettivo di ridurre i falsi positivi, migliorando sensibilmente la precisione finale. Abbiamo implementato 2 metodi differenti per identificare i risultati errati forniti da BINO e MemRec. Il primo consiste nel calcolo della Graph Edit Distance (GED) tra il Control-Flow Graph (CFG) composto dai basic blocks che potrebbero contenere il codice di un metodo inlined, e i CFG dei corrispondenti metodi inlined riconosciuti. La valutazione si basa sul numero di modifiche necessarie per trasformare il CFG composto dai basic blocks identificati da MemRec nel CFG del metodo inlined. Il secondo consiste nell’utilizzo di 2 frameworks, Asm2Vec e Palmtree, per la ricerca di cloni, basandosi sulla rappresentazione vettoriale di un codice assembly. Questi frameworks generano un vettore che rappresenta il codice dei basic blocks contenenti codice inlined, indicati da BINO o MemRec, e del relativo metodo inlined. Abbiamo poi comparato i vettori generati utilizzando la similarità coseno. Il nostro framework utilizza una serie di modelli per classificare un match come vero o falso positivo, a partire dai valori ricavati dall’analisi con Palmtree, Asm2Vec e GED. Abbiamo addestrato il framework su 100 progetti ricavati da un dataset di progetti C++ presenti su GitHub e valutato poi tramite l’analisi di 100 binaries. Complessivamente, INReco ha ottenuto una recall di 0.90 e una precisione di 0.77.
INReco: INlined functions RECOgnizer
Cappellozza, Francesco
2023/2024
Abstract
In this thesis, we focus on the reverse engineering task of identifying functions that have been inlined by the compiler as a result of optimizations. We leveraged BINO and MemRec, two frameworks aimed at addressing this challenge by capturing certain features of assembly functions, such as Control-Flow Graph structure, syntactic and semantic features, and memory accesses. We integrated the two frameworks and adopted two strategies to assess BINO and MemRec results. The main goal consists on discarding false positives, making the final result noticeably more reliable. The first strategy applied consists on calculating the Graph-Edit Distance (GED) between the Control-Flow Graph (CFG) of the basic blocks identified by MemRec, containing inlined code, and the CFGs of the corresponding recognized inlined functions. Our evaluation focuses on assessing the similarity between the Control-Flow Graph of the basic blocks identified by MemRec and the CFG of the function claimed to be inlined. The GED is used to evaluate the number of edit paths required to make a graph isomorphic to another one. The other strategy involves 2 frameworks, Asm2Vec and Palmtree, for the clone search task relying on the vector representation of the assembly code. These frameworks generate vector representations of assembly code for basic blocks claimed to be inlined by BINO or MemRec and the corresponding inlined function. The two vectors are then compared evaluating cosine similarity. We realized a set of models able to classify a match as a true or a false positive, given the Asm2Vec, Palmtree and GED values. We trained the framework on one hundred projects taken of a large GitHub C++ dataset. INReco and the models have been evaluated on 100 C++ binaries. Overall, we achieved a recall of 0.9 with a precision of 0.77.File | Dimensione | Formato | |
---|---|---|---|
2024_07_Cappellozza_Executive_Summary.pdf
accessibile in internet per tutti
Descrizione: Executive summary
Dimensione
899.21 kB
Formato
Adobe PDF
|
899.21 kB | Adobe PDF | Visualizza/Apri |
2024_07_Cappellozza_Tesi.pdf
accessibile in internet per tutti
Descrizione: Tesi
Dimensione
7.77 MB
Formato
Adobe PDF
|
7.77 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/222741