Binary function vulnerability discover through LLVM IR

The existence of security vulnerabilities in programs is one of the problems that mostly defines modern computer era; as a consequence, nowadays, soft- ware analysis has become one of the most relevant fields in computer security and an increasing number of researchers is currently working on the develop- ment of tools that could discover security vulnerabilities in the shortest time possible. In particular, there is a type of software analysis based on the anal- ysis of program executables in binary format, which is binary analysis, that is quite useful because it can be employed to find vulnerabilities in softwares that have already been released. Recently, the usage of intermediate representation languages, which are de- signed to simplify and enhance the analysis of program executables, has acquired an increasing interest in binary analyses researches. We decided to employ one of such languages also in our project in order to cover different machine architectures and also to make our tool easily extendable. In our thesis we focus on a specific type of binary analysis, static binary analysis, which analyzes the executable without actually executing or emu- lating it, building an intermediate representation of it. We chose to use this approach because it ensures the coverage of the whole binary code inside the executable. Our project’s goal is to design a security tool that is capable of detecting a particular type of vulnerability, known as buffer overflow, in compiled soft- wares (binary executables). In particular, we are interested in a specific type of buffer overflow, the loop-based buffer overflow: this type of vulnerability happens when the program contains a loop that at each iteration stores an element of a source buffer into a destination buffer without checking the desti- nation size. This loop is typically controlled by an user input, hence allowing the attacker to overwrite variables stored in program memory next to the destination buffer and, in the worse case, to execute malicious injected code. Moreover, this type of vulnerability is very common in strcpy-like functions. In order to identify such vulnerabilities in binaries from different archi- tectures, our tool translates the binary input into an intermediate language designed for program analyses and scans this intermediate representation to find any vulnerabilities. This solution allows also to produce a modular tool that can be easily extended and enhanced. After that, the tool scans the functions call chain of the program executable and tracks user input propa- gation from specific source functions to all the other functions, implementing a simple taint analysis. This approach allows to filter out functions that contain buffer overflows but are not controlled by user input. We designed three experiments to demonstrate the abilities of our tool in detecting buffer overflow vulnerabilities in different types of binaries. The first experiment tests tools abilities against both dynamically and statically linked binaries taken from public CVE lists of vulnerable programs. The second experiment tests tools abilities against binaries extracted from an ARM-based router’s embedded firmware that had never been analyzed be- fore. The last experiment tests our tools abilities against DARPA Cyber Grand Challenges example binaries, built on top of a custom operating sys- tem. Our tests show that tool is able to identify 11 of the 15 vulnerabilities cho- sen in the first experiment, as well as a not yet discovered vulnerability inside one of the binaries of the firmware for the second experiment. The results of the third experiment show that the tool marks as vulnerable at least one function in all the binaries that are known to contain a buffer overflow vul- nerability. Overall results demonstrate that our tool can be efficiently used to simplify vulnerability detection in binaries, but it still requires improvements on the detection precision.

L’esistenza di vulnerabilità di sicurezza nei programmi è una delle problem- atiche che maggiormente caratterizzano la moderna era informatica, tanto che ormai l’analisi del software è divenuta uno degli argomenti più discussi e un sempre crescente numero di ricercatori in una corsa contro il tempo alla ricerca di una soluzione efficace che permetta di individuare le vulnera- bilità nei programmi il più velocemente possibile. In particolare, esiste una specifica tipologia di analisi dei programmi basata sull’analisi degli eseguibili dei programmi in formato binario, chiamata binary analysis, la quale risulta particolarmente utile perché può essere utilizzata per trovare vulnerabilità in software che sono già stati rilasciati. Di recente l’utilizzo di intermediate representation languages, che sono lin- guaggi ideati per semplificare e migliorare le analisi di software eseguibili, ha ricevuto una particolare attenzione nelle ricerche di binary analysis. Per questo motivo, anche noi abbiamo deciso di adottare uno di questi linguaggi all’interno del nostro software, in modo da poter gestire facilmente diverse architetture e rendere il nostro tool modulare. Nel nostro progetto ci soffermiamo principalmente sulla static binary anal- ysis poiché essa garantisce la copertura completa del codice all’interno del binario, mentre la dynamic binary analysis non può testare tutti gli input del programma per via delle dimensioni in genere troppo grandi del dominio degli input. In questo progetto presentiamo un tool di sicurezza informatica in grado di trovare un particolare tipo di vulnerabilità software, chiamato buffer over- flow, all’interno di programmi già compilati. In particolare, il nostro progetto si sofferma su un particoalare tipo di buffer overflow, i loop-based buffer over- flow: questo tipo di vulnerabilità è presente nei programmi che contengono loop che ad ogni iterazione copiano un elemento da un buffer sorgente a un buffer destinazione, senza controllare le dimensioni della destinazione. Generalmente, la condizione di tali loop è controllata dall’input dell’utente e questo permette a un possibile attacker di poter sovrascrivere varibili in memoria adiacenti al buffer di destinazione, permettondogli inoltre, nel caso peggiore, di eseguire del codice malevolo. Il nostro tool traduce i binary in input in una rappresentazione intermedia utilizzando un intermediate representation language, dopodiché analizza tale rappresentazione in cerca di vulnerabilità. Questa soluzione ci ha permesso di creare un tool modulare che può essere facilmente esteso e migliorato. Inoltre, il tool è in grado di tracciare la propagazione dell’input dell’utente, attraverso le chiamate di funzioni, a partire da specifiche funzioni sorgente. Questo permette al nostro tool di eliminare dai risultati quelle funzioni che contengono dei buffer overflow che non sono controllati dall’utente. Abbiamo delineato tre esperimenti che dimostrano le abilità del nostro tool nella individuazione delle vulnerabilità di tipo buffer overflow all’interno di binari di diverso genere. Il primo esperimento testa le abilità del tool usando binari linkati sia dinamicamente che staticamente a librerie esterne, presi da liste pubbliche di programmi vulnerabili. Il secondo esperimento, invece, testa le sue abilità usando binari estratti dal firmware di un router basato sull’architettura ARM. Il terzo e ultimo esperimento infine testa il tool impiegando i binari di esempio offerti dalla DARPA Cyber Grand Challenge, costruiti usando un particolare tipo di sistema operativo usato per queste competizioni. I nostri test dimostrano che il tool è capace di identificare 11 dei 15 binari pubblicamente noti come vulnerabili nel primo esperimento, e ci ha permesso inoltre di identificare una vulnerabilità non ancora nota all’interno di uno dei binari estratti dal firmware del secondo esperimento. I risultati del terzo esperimento mostrano inoltre che il tool è stato capace di identificare come vulnerabili tutti i binari contenenti almeno un buffer overflow. Nel complesso, i risultati dimostrano che il nostro tool può essere utilizzato efficacemente per semplificare l’individuazione di questo tipo di vulnerabilità, anche se necessita ancora dei miglioramenti per la precisione dei risultati.