Detecting Android malware campaigns via application similarity analysis

Due to the increasing detection of Android malware and the ever growing worldwide market share of Google's mobile operating system, which reached 84.7% in Q3 2015, the tools used to identify malicious applications must be kept updated and they need to offer a fast way to have the most reliable overview on the malware scene. As part of the effort to mitigate this threat, we present an approach and tool that allow the security analysts to perform accurate searches on the analysed applications and find correlations between them, grouping malicious applications into clusters of similarity. Our work is based on the notion that the most valuable asset that a malware can steal is the user's data and that, in order to retrieve them, the application must connect to an endpoint (e.g., a server, a phone) controlled by the criminal. Performing a static analysis of the source code and then comparing the information gathered with third-party dynamic tools, we obtain the list of endpoints contained in the application; if some of them are categorized as malicious, we can conclude that the application is a threat. In order to achieve this goal, we collect general information about the application (e.g. the package name, the Android version required, the usage of Google Cloud Messaging); the components of the application (i.e., the activities, services, broadcast receivers and content providers which the application is composed of); textual information contained in the application (e.g. the strings, the URLs, the phone numbers). We enriched the collected static data by classifying the endpoints according to their maliciousness and by localising the geographical area targeted by the malware, though a deeper investigation on the strings. The comparison between applications, in order to find correlations, is performed both on the information gathered and on the similarity of the packages: if two applications are connected to the same malicious endpoint, they are under control of the same treat agent; similarly, if two applications follow the suggested guidelines and have similar package names, they have likely been developed by the same company or generated by the same crimeware kit. The final result of our work is a web application based on Elasticsearch, a service which offers both the functionalities of an online database and of a full-text indexed search engine. Its flexibility allows us to perform queries on the samples stored and discover the clusters of similarity to which a malware belongs to in almost real-time.

A causa dell'aumentare del numero di malware per Android rilevati e la costante crescita della fetta di mercato mondiale controllata dal sistema operativo di Google, che ha raggiunto l'84.7% nel Q3 2015, gli strumenti di identificazione di applicazioni pericolose devono essere mantenuti aggiornati e devono offrire un'affidabile visione d'insieme sul panorama criminale. Come parte del lavoro dedicato a mitigare questa minaccia, presentiamo un'approccio e soluzione che permette agli analisti di sicurezza di effettuare accurate ricerche tra le applicazioni già analizzate e trovare correlazioni tra di esse, creando gruppi di similarità. Il nostro lavoro è basato sulla nozione che il bene di maggior valore che un malware può rubare siano i dati degli utenti e che, per poterli ottenere, l'applicazione deve connettersi a un endpoint (e.g., un server, un telefono) sotto il controllo del criminale. Eseguendo un'analisi statica del codice sorgente dell'applicazione e confrontando le informazioni ottenute con strumenti dinamici di parte terze, otteniamo la lista degli endpoint contenuti dalla applicazione; se alcuni di essi sono categorizzati come dannosi, possiamo concludere che l'appicazione sia un pericolo. Per potere raggiungere questo obiettivo, raccogliamo informazioni generali riguardanti l'applicazione (e.g., il package name, la versione di Android richiesta, l'utilizzo di Google Cloud Messaging); i componenti dell'applicazione (i.e., le activity, i service, i broadcast receivers e i content provider che costituiscono l'applicazione); informazioni testuali contenute dall'applicazione (e.g., le stringhe, gli URL, i numeri telefonici). Abbiamo arricchito i dati statici raccolti classificando gli endpoint in base alla loro pericolosità e localizzando il mercato di riferimento del malware, attraverso un'investigazione approfondita delle stringhe. La comparazione tra applicazioni, al fine di trovare correlazioni, è effettuata sia sulle informazioni raccolte, sia sulla similarità tra i package name: se due applicazioni sono connesse al medesimo endpoint malevolo, sono sotto controllo dello stesso criminale; se due applicazioni rispettano le linee guida suggerite e hanno i package name simili, sono state probabilmente sviluppate dalla medesima compagnia o sono state generate attraverso lo stesso crimeware kit. Il risultato finale del nostro lavoro è una web application basato su Elasticsearch, un servizio che offre sia le funzionalità di un database online che quelle di un motore di ricerca indicizzato. La sua flessibilità ci permette di eseguire query parametrizzate e scovare i cluster di similarità a cui un malware fa parte approssimativamente in tempo reale.