Metodologia per l'aggiornamento continuo di una rete semantica orientata alla sentiment analysis nei social media

Blogs, forums and social networks are by now a widespread reality in the Internet world. Tens of millions of users use these tools to express their opinions and feelings, producing huge volumes of non-structured textual data. These data represent a priceless source of value for anyone that would like to know the Web reputation for a brand of any type, from products to services, from cities to people. We need sophisticated applications of Natural Language Processing to be able to deduce useful information from the huge amount of raw data; one of them is Sentiment Analysis that recognizes positive or negative sentiment enclosed in a message written in natural language. Moreover, the identification of the sentiment polarity in a document has to be associated to the ability of understanding to which components and qualities of the subject we could assign the judgment expressed by the author. An effective analysis of Web reputation of a brand requires the tool of sentiment analysis to be able to understand users’ average judgment about the different features that describe the brand and that in the whole form an attractivity model. This thesis deals with the problem of outlining a methodology to define the attractivity model of a brand; we want the model to respect requirements of completeness and accuracy and to be always updated, consistent with its dynamism and the dynamism of the domain it belongs to. The identification of the brand's features requires both a deep knowledge of the domain by an expert and the automatic work performed by the tool which, starting from volume of speech, has to extract new brand's features and suggest to the user their positioning in the model in order to enhance future analysis. These two tasks are peculiar also to another kind of analysis that is dealt with in this thesis, that is the topic trend detection; it consists of two steps: first of all the detection of words that have recently shown a burst, than the attempt at proposing to the user the interpretation of the argument which hides behind these words and the event that could have generate the burst. This research work aims also to define some tools that help the user to understand these kind of information.

Blog, forum e social network sono ormai da anni una realtà consolidata del mondo Internet. Decine di milioni di utenti utilizzano questi strumenti per esprimere le proprie opinioni e le proprie sensazioni, producendo imponenti volumi di dati testuali non strutturate. Questi dati rappresentano una fonte inestimabile di valore per chiunque voglia conoscere la reputazione online di un brand di qualunque natura, dai prodotti ai servizi, dalle città alle persone. Per riuscire a ricavare informazioni utili dall’enorme mole di dati servono sofisticate applicazioni di Natural Language Processing: una su tette la Sentiment Analysis, che ha come obiettivo il riconoscimento della positività o negatività (sentiment) racchiusi in un messaggio scritto in linguaggio naturale. Il riconoscimento della polarità del sentiment in un documento deve inoltre essere accompagnato dalla capacità di comprendere a quale soggetto tale sentiment va attribuito. E ancora più interessante sarebbe riuscire a capire a quali componenti e qualità del soggetto si può attribuire il giudizio espresso dall’autore. Un’efficace analisi della Web reputation di un brand richiede che il tool di sentiment analysis riesca a comprendere il giudizio medio degli utenti sulle varie caratteristiche che descrivono il brand e che nel loro complesso formano il suo modello di attrattività. Scopo di questa tesi è di tracciare una metodologia per definire il modello di attrattività di un brand; si vuole che tale modello rispetti i requisiti di completezza e correttezza e che sia sempre aggiornato, in linea con la dinamicità sua e del dominio in cui si colloca. Il riconoscimento delle caratteristiche del brand richiede sia la conoscenza di dominio di un esperto, sia il lavoro automatico del tool che, partendo dai volumi di parlato, deve riconoscere nuove feature del brand e suggerire all’utente la collocazione nel modello per migliorare le analisi successive. Questi due task sono propri anche di un altro tipo di analisi che è oggetto di studio in questa tesi, ovvero la topic trend detection; essa consiste di due fasi: prima il rilevamento di parole che recentemente hanno evidenziato un picco di occorrenze, poi il tentativo di suggerire all’utente una chiave di lettura per comprendere l’argomento che si cela dietro queste parole e l’evento che ha scatenato il picco. Questo lavoro di tesi intende anche definire una serie di strumenti che facilitino questo tipo di comprensione all’utente.