In binary code analysis, identifying which libraries have been compiled within a statically linked binary can be of great interest for some scenarios, such as bug searching or reverse engineering. However, this identification can be challenging due to the diversity of compilers, compilation settings (such as optimization levels) and software architectures, which introduce differences in the compiled code. Moreover, libraries can have many different versions, making their identification even harder. In this thesis we propose an architecture-independent approach, implemented into a tool called LIBS (Libraries Identification in BinarieS), to identify which libraries have been compiled inside a statically linked binary and to recognize their correct versions. Our approach works by performing library functions identification between a statically linked binary and one or more static libraries to find the similarities between their functions. It is based on a word embedding model to extract a semantic feature from each function; this feature is used, together with other structural and syntactical ones, to measure the similarity between two functions. To be able to work with different software architectures, our approach works on top of an intermediate representation of the code, the LLVM IR. Experimental validation, performed for two different software architectures, MIPSel and x86_64, shows that our approach can achieve good results in identifying which libraries have been compiled inside a statically linked binary, with a Precision of 87% for MIPSel and 93% for x86_64. Moreover, experimental results show that LIBS is able to identify the libraries even between different versions and to correctly provide their exact version with an Accuracy of 72% for MIPSel and 84% for x86_64.
Durante l’analisi di file binari compilati staticamente, identificare quali librerie sono presenti al loro interno può essere di grande interesse per ambiti quali la ricerca di bug o il reverse engineering. Tuttavia, questa identificazione può risultare difficile a causa di differenti compilatori, impostazioni di compilazione (come i livelli di ottimizzazione) e architetture software, che introducono differenze nel codice compilato. Inoltre, le librerie possono avere molte versioni diverse, il che rende ancora più difficile la loro identificazione. In questa tesi proponiamo un approccio multi-architettura, implementato in un tool chiamato LIBS (Libraries Identification in BinarieS), per identificare quali librerie sono presenti all’interno di un binario compilato staticamente e riconoscere la loro esatta versione. Il nostro approccio compara un binario compilato staticamente e una o più librerie statiche per trovare similarità tra le loro funzioni. Si basa su un modello di word embedding per estrarre la semantica di ogni funzione per misurare, insieme ad altre proprietà di tipo strutturale e sintattico, la similarità tra due funzioni. Per essere in grado di lavorare con differenti architetture software, il nostro approccio lavora su una rappresentazione intermedia del codice, la LLVM IR. Risultati sperimentali derivanti da test condotti su due diverse architetture software, MIPSel e x86_64, mostrano che il nostro approccio è in grado di ottenere buoni risultati nell’identificare quali librerie sono presenti all’interno di un binario compilato staticamente, con una Precisione dell’87% per MIPSel e del 93% per x86_64. Inoltre, i risultati sperimentali dimostrano che LIBS è in grado di identificare le librerie anche tra versioni differenti e di fornire correttamente la loro esatta versione con una Accuratezza del 72% per MIPSel e dell’84% per x86_64.
LIBS: libraries identification in statically linked binaries using word embeddings
RIZZI, MATTEO
2022/2023
Abstract
In binary code analysis, identifying which libraries have been compiled within a statically linked binary can be of great interest for some scenarios, such as bug searching or reverse engineering. However, this identification can be challenging due to the diversity of compilers, compilation settings (such as optimization levels) and software architectures, which introduce differences in the compiled code. Moreover, libraries can have many different versions, making their identification even harder. In this thesis we propose an architecture-independent approach, implemented into a tool called LIBS (Libraries Identification in BinarieS), to identify which libraries have been compiled inside a statically linked binary and to recognize their correct versions. Our approach works by performing library functions identification between a statically linked binary and one or more static libraries to find the similarities between their functions. It is based on a word embedding model to extract a semantic feature from each function; this feature is used, together with other structural and syntactical ones, to measure the similarity between two functions. To be able to work with different software architectures, our approach works on top of an intermediate representation of the code, the LLVM IR. Experimental validation, performed for two different software architectures, MIPSel and x86_64, shows that our approach can achieve good results in identifying which libraries have been compiled inside a statically linked binary, with a Precision of 87% for MIPSel and 93% for x86_64. Moreover, experimental results show that LIBS is able to identify the libraries even between different versions and to correctly provide their exact version with an Accuracy of 72% for MIPSel and 84% for x86_64.File | Dimensione | Formato | |
---|---|---|---|
2024_04_Rizzi_Executive_Summary_02.pdf
accessibile in internet per tutti
Descrizione: Executive Summary
Dimensione
452.71 kB
Formato
Adobe PDF
|
452.71 kB | Adobe PDF | Visualizza/Apri |
2024_04_Rizzi_Thesis_01.pdf
accessibile in internet per tutti
Descrizione: Thesis
Dimensione
603.33 kB
Formato
Adobe PDF
|
603.33 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/219657