Analysis and detection of social bots on Twitter mimicking human interests in people or contents

Abstract This work wants to study the Twitter world, aiming at the identification of bots, namely the algorithmically driven accounts that populate this Social Network. The thesis focuses on bots mimicking the human behavior through the interactions that they are able to perform in the platform, therefore identifying three main actions performed by the bots (retweet, mention, hashtag) with the relative harms that they can bring. The dissimilar behaviors between genuine users and content polluters can be spotted by the di erence in the tweets posted. Therefore the first step is the collection of users and tweets through the streaming API of Twitter, in order to approach the problem in an unsupervised way by gathering the data in real time. In particular the collection of data was made by saving the tweets and the users that were posting on Twitter from throughout the United States in a specific time window. It follows the building up of graphs from every collected user, based on the action they performed, resulting in one graph for the retweets, one for the mentions and one for the hashtags, for each user. Then a node embedding algorithm was performed in order to have numerical features, with the goal of obtaining distinct clusters. This approach did not bring the desired results probably due to the fact that the context of the application was not optimal. The second phase of the thesis is a collection of previous works in literature in order to assemble a labeled dataset so that the approach can be supervised. Starting from these data, an accurate features extraction pays particular attention to the features that consider the aforementioned interactions. The next phase goes deeper in the analysis of the bot category. The thesis proposes a finer level of granularity and it is detached from lot of works in literature that consider the problem of telling bot and human apart in a binary way (0-1). It identifies, indeed, three particular types of bot behaviors: bots that are massively retweeting users in order to endorse someone or something (RETWEETER), bots replying and interacting with humans in a deceiving way, in order to expand their web (MENTIONER), and bots that are spamming hashtags relative to their field in order to promote themselves or the products they are selling (PROMOTER). Once identified all these specific categories, the resulting multi-class dataset is divided in train and test set, then several classification algorithms are trained on it: the most performing one is the Random Forest, giving good numerical results (average F1 score of 0.868). The final phase of the thesis is testing the trained algorithm “in the wild”, namely in an unbiased dataset completely di erent from the one used in the training/testing phase, in particular the one collected in the first step through the streaming API. Therefore it follows an estimation of the Twitter population and a manual estimate of the real precision of the algorithm on these random data.

Sommario Questo lavoro vuole studiare il mondo di Twitter, mirando all’individuazione dei bot, i cos`ı detti account guidati da un algoritmo che popolano questo Social Network. La tesi si concentra sui bot che imitano il comportamento umano attraverso le interazioni che la piattaforma permette loro, identificando quindi tre principali azioni che i bot compiono (retweet, mention, hashtag) coi relativi danni che possono causare. I comportamenti dissimili tra utenti genuini e bot possono essere trovati nelle di erenze tra i loro tweet. Il primo passo quindi `e raccogliere gli utenti e i tweet attravirso la streaming API che Twitter fornisce, cos da approcciare il problema in maniera non supervisionata raccogliendo i dati in tempo reale. In particolare il raccoglimento dei dati stato fatto salvando i tweet e gli utenti che postavano su Twitter dagli Stati Uniti in un determinato intervallo temporale. Segue la costruzione dei grafi partendo dalle interazioni, nello specifico da ogni utente si costruisce un grafo per i retweet, uno per le mention e uno per gli hashtag. Poi viene eseguito un algoritmo di node embedding per ottenere dei dati numerici con l’obiettivo di identificare dei gruppi distinti. Questo approccio non porta i risultati sperati probabilmente per il fatto che viene applicato in un contesto non ottimale. La seconda fase della tesi `e quindi raccogliere i dati da precedenti lavori della letteratura, per avere un dataset etichettato cos`ı da poter adottare un approccio supervisionato. Partendo da questi dati, un’accurata estrazione delle features fa particolare attenzione a quelle che considerano le sopracitate interazioni. La fase successiva va pi`u a fondo nello studio nella categoria dei bot raccolti. La tesi propone un livello pi`u elevato di granularit`a e si distacca da molti lavori della letteratura che invece considerano il problema di riconoscere un bot, in modo binario (0-1). Si indentificano, infatti, tre particolari tipi di comportamento di bot: bot che retweetano massivamente altri utenti per appoggiare qualcuno o qualcosa (RETWEETER), bot che rispondono e interagiscono con gli umani in modo subdolo per espandere la loro rete di amici (MENTIONER), e bot che spammano hashtag per promuovere loro stessi o i prodotti che vendono (PROMOTER). Una volta identificate queste categorie, il dataset multiclasse risultante viene diviso in train set e test set, quindi diversi algoritmi di classificazione sono addestrati su di esso: il migliore risulta essere il Random Forest con uno score F1 medio di 0.868. La fase finale della tesi testa l’algoritmo precedentemente addestrato “in the wild”, cio`e su dei dati che non sono distorti completamente diversi da quelli usati durante la fase di addestramento/validazione dell’algoritmo, nella fattispecie il primo dataset ottenuto tramite la streaming API. Quindi si procede con una stima della popolazione di Twitter e una stima manuale della vera precisione dell’algoritmo su questi dati casuali.