Social networks-based extraction of relationships between concepts. From social content to knowledge

Social data analysis is now one of the most active fields of interest in Data Science. Social data appears in big volumes and can be easily obtained by using download interfaces offered by most social networks. They contain knowledge, but understanding such knowledge is very difficult due to the noisy nature of social content. A significant amount of work has been dedicated so far to entity extraction, i.e. the discovery of new entities from social content; in contrast, less work has been dedicated to the extraction of relationships between entities. The objective of this work is to extract knowledge in the form of relationships between concepts from the plain text of tweets and posts. This work cannot be done for an arbitrary knowledge graph, as extraction methods need a focus; therefore, we apply our approach to three specific domains (fashion, TV series, rugby); however, the method is general and can be replicated to other comparable (i.e., small-size) knowledge domains. In the thesis, we show raw social data extraction from Twitter and Facebook so as to produce simplified texts with identified tokens (hashtags, keywords). Then, such texts are processed using Natural Language Process (NLP) techniques; tweets and posts are transformed into triples "entity1-relationshipship-entity2". From these triples, relationships are collected and classified according to different linguistic techniques. The result of the whole process is a domain description with a classification of relationships, represented with frequency histograms. From the collected results, we are able to create the graphical representation of the domain in terms of classified entities and relationships. The whole process does not make or need any a-priori knowledge of the domain of interest, besides using domain-specific hashtags for knowledge extraction.

L’analisi dei dati “sociali” provenienti da Social Networks è uno dei campi di maggiore interesse in Data Science. I dati sociali sono disponibili in enorme quantità e sono abbastanza facili da reperire. I dati sociali sono ricchi di conoscenza e informazioni, ma queste sono difficili da comprendere a causa della natura stessa dei dati. Una notevole mole di lavoro è stata finora dedicata all’estrazione di entità, cioè la scoperta ti nuove entità dal contenuto dei dati sociali. Invece, molto meno lavoro è stato svolto a riguardo dell’estrazione delle relazioni che intercorrono tra entità. L’obbiettivo di questo lavoro è di estrarre conoscenza, nella forma di relazioni tra concetti, dal testo di tweets e posts. Questo procedimento non è applicabile su un qualsiasi grafo di conoscenza, in quanto i metodi di estrazione richiedono una focalizzazione. Di conseguenza, il nostro approccio è stato applicato a tre specifici domini: moda, serie TV e rugby. Il metodo ideato è però generale e può essere replicato in altri domini di conoscenza comparabili (di piccole dimensioni). Nella tesi, partendo da dati sociali “grezzi” estratti da Twitter e Facebook produciamo testi semplificati e modificati con dei termini identificativi (hashtgs, parole chiave). Successivamente, questi testi sono processati con tecniche di Elaborazione del Linguaggio Naturale (NLP in inglese); tweets e posts sono trasformati in triple “entità1- relazione-entità2”. Da queste triple le relazioni vengono poi raccolte e classificate secondo differenti tecniche linguistiche. Il risultato dell’intero processo è una descrizione del dominio con una classificazione delle relazioni, rappresentate con istogrammi di frequenze. Con i sopracitati risultati, è possibile creare una rappresentazione grafica del dominio tramite classi di entità e relazioni. L’intero processo non fa ne richiede alcuna conoscenza a-priori del dominio di interesse, tranne l’utilizzo di termini di ricerca specifici del dominio per l’estrazione dei dati iniziali.