Modelling, analysis and prediction of information diffusion with graph databases and Apache spark

Social media have provided a new way to understand how people interact between them generating a massive amount of information every day. In the latest years, Big Data technologies have provided to scientists new tools to deal with such amount of data and new approaches have been proposed to deal with these problems. With respect to the phenomenon of information diffusion between communities in social networks, it has been a key factor to understand different events results like elections or referendums in a country or in a more general context it can explain how some clothes or places become in fashion very fast. Two of the new technologies that are proposed for dealing with this kind of data are Apache Spark and graph databases. Apache Spark provides the ability to deal with massive amount of data in a scalable way and graph databases allow to build complex databases with information stored in nodes and also in relationships. The goal of this thesis is to build and test models able to predict information diffusion in social networks. The first phase of the thesis will consist on building the framework to analyse different models. In this phase different data collectors will be developed and connected with a graph database that will allow the building of more complex models. Then, an analysis of different models to characterize information diffusion will be performed and with the results some new models will be proposed. In the final phase a connection between the graph database and Apache Spark will be implemented in order to perform analytics and predictions with respect to the models and the data collected to predict the diffusion and the influence of a campaign in social networks.

I social media hanno fornito un nuovo modo per capire come le persone interagiscano tra di loro generando ogni giorno una quantità enorme di informazioni. Negli ultimi anni le tecnologie Big Data hanno fornito agli scienziati nuovi strumenti per affrontare tale quantità di dati e sono stati proposti nuovi approcci per affrontare questi problemi. Per quanto riguarda il fenomeno della diffusione delle informazioni tra le comunità nelle reti sociali, esso è stato un fattore chiave per comprendere i risultati di eventi diversi come le elezioni o i referendum in un paese o in un contesto più generale può spiegare come alcuni abiti o luoghi diventino di moda molto velocemente. Due delle nuove tecnologie proposte per affrontare questo tipo di dati sono Apache Spark e i database a grafo. Apache Spark offre la possibilità di gestire una quantità enorme di dati in modo scalabile e i database a grafo permettono di creare basi di dati complesse con informazioni memorizzate nei nodi e anche nelle relazioni. L'obiettivo di questa tesi è quello di costruire e testare modelli in grado di prevedere la diffusione delle informazioni nelle reti sociali. La prima fase della tesi consisterà nel costruire il framework per analizzare i diversi modelli. In questa fase verranno sviluppati diversi metodi di estrazione dei dati, che verranno collegati ad un database a grafo, per consentire la costruzione di modelli più complessi. Quindi verrà eseguita un'analisi di diversi modelli per caratterizzare la diffusione delle informazioni e con i risultati saranno proposti alcuni nuovi modelli. Nella fase finale verrà implementata una connessione tra il database a grafo e Apache Spark per eseguire analisi e previsioni relative ai modelli e ai dati raccolti per prevedere la diffusione e l'influenza di una campagna nelle reti sociali.