Community analysis using graph representation learning on social networks

In a world more and more connected, there is the opportunity to model this extreme degree of relationship among people in order to discover new and more complex patterns. A graph is the mathematical model that can be better exploited in the Web 2.0 era: the era of social networks. In fact, it is the model that perfectly fits for representing the interactions on platforms such as Facebook or Instagram: users can interact in many different ways, creating posts, putting ”likes” and mentioning other users and in this way they incrementally build an enormous graph. This graph becomes much bigger if a set of interacting users, a community, is considered together. This work deals with the problem of exploiting the network structure in order to map similarities between users inside communities detected on on-line social networks. The objective is the definition of a method to handle in an efficient way the heterogeneity of a social network, in order to encode all the data needed in a simpler graph model. The method presented allows to extract a ”classical” social network, with only user nodes, from a much more complex network, without losing the necessary information to effectively capture users behaviour. Using this approach, two weighted, homogeneous networks are defined: the hashtags network and the mentions network. For the first network, the weights are given by the number of common hashtags used by each pair of users, while for the second they are the number of mentions made by each user. A very recent technique, known as representation learning, is then applied to these networks in order to describe user nodes in term of a continuous feature vector, that is used to perform classification and clustering. In the first experiment, the resulting graphs are used to generate features that give a description as rich as possible of the users inside the network. These features are combined to train the model for a classification task that tries to discriminate between ”consumer” and ”non-consumer” users. Since the networks reduction allows to minimize the number of nodes, it is also possible to evaluate the influence of a broader set of users inside the same classification task: in fact, not labelled users contribute to the definition of relationship with the labelled ones, increasing the descriptive information of the features. In all the tests performed, the baseline, defined using features extracted from the account, is always overcome. In the second experiment, a similar process is developed using an unsupervised method. The objective is to discover sub-communities inside the principal ones, extending the classical problem of community detection. The set of features extracted are used as input for the K-means algorithm and the output defines a set of sub-communities that are validated with the help of domain experts. Their feedback, combined with a set of labels extracted from the networks, shows that the users can be divided in meaningful categories, thus verifying the power of the method in discovery hidden patterns.

In un mondo sempre più connesso, vi è l’opportunità di modellizzare l’estremo grado di relazione tra le persone in modo da scoprire nuovi e più complessi patterns. Un grafo è il modello matematico che può essere meglio sfruttato nell’era del Web 2.0: l’era dei social network. Infatti, è il modello che si adatta perfettamente a rappresentare le interazioni su piattaforme come Facebook o Instagram: gli utenti possono interagire in molti modi diversi, creando post, mettendo ”like” e menzionando altri utenti e in questo modo costruiscono incrementalmente un enorme grafo. Questo grafo diventa molto più grande se un gruppo di utenti che interagiscono tra loro, una community, viene considerato insieme. Questo lavoro affronta il problema di sfruttare la struttura a rete in modo da mappare similarità tra utenti all’interno di community individuate su social network on-line. L’obiettivo è la definizione di un metodo per gestire in modo efficiente l’eterogeneità intrinseca di un social network, in modo da codificare tutti i dati necessari in un modello a grafo più semplice. Il metodo presentato permette di estrarre una rete sociale ”classica”, contenente solo nodi di tipo utente, da una rete molto più complessa, senza perdere le informazioni necessarie per catturare efficacemente il comportamento degli utenti. Usando questo approccio, due reti pesate e omogenee sono definite: la rete degli hashtag e quella delle menzioni. Per la prima rete, i pesi sono ottenuti dal numero di hashtag in comune per ogni coppia di utenti, mentre per la seconda sono dati dal numero di menzioni fatte da ogni utente. Una tecnica molto recente, conosciuta come representation learning, è quindi applicata a queste reti in modo da descrivere i nodi utente come un vettore di feature continue, che è utilizzato per svolgere classificazione e clustering. Nel primo esperimento svolto, i grafi risultanti sono utilizzati per generare feature in modo da avere una descrizione più ricca possibile degli utenti all’interno della rete. Queste feature sono combinate per svolgere il training di un modello in modo da risolvere un task di classificazione: la discriminazione tra utenti ”consumer” e ”non consumer”. Poichè la riduzione delle reti permette di minimizzare il numero di nodi, è anche possibile valutare l’influenza di un gruppo di utenti più ampio all’interno dello stesso task di classificazione: infatti, gli utenti non classificati contribuiscono alla definizione di relazioni con quelli classificati, incrementando le informazioni descrittive delle feature. In tutti i test svolti, la baseline, definita utilizzando feature estratte dagli account, è sempre superata. Nel secondo esperimento, un processo similare viene sviluppato usando un metodo non supervisionato. L’obiettivo è scoprire sotto-community all’interno di quelle principali, estendendo il problema classico noto come community detection. Le feature estratte sono utilizzate come input per l’algoritmo K-means e l’output definisce un insieme di sotto-community che sono validate con l’aiuto di esperti di dominio. Il loro feedback, combinato con un insieme di label estratte dalle reti, mostra che gli utenti possono essere suddivisi in categorie significative, verificando quindi la potenza del metodo nello scoprire pattern nascosti.