Static fuzzy bag-of-words : a new methodology for calculating sentence embeddings

The field of Natural Language Processing (in brief NLP) attracted much attention of the scientific community in the last few years thanks to the surprising results achieved, substantiated for example in conversational agents and systems of machine translation. A keystone for the research in this field has been the introduction of textual embeddings, vectors able to represent the semantic meaning of words or sentences in a Euclidean space; these vectors are obtained through unsupervised algorithms that receive a huge corpus of text as input. The idea underlying embeddings is that words or sentences with a similar semantic meaning are associated to "close" vectors. Measuring the distance between these vectors therefore becomes a problem as relevant as the creation of the representations themselves but that often is neglected in literature. In this thesis first we will highlight the limits of cosine similarity (the most commonly used similarity measure in this fi eld) and we will propose measures alternatives to it, then we will illustrate a model for sentence embedding based on fuzzy sets theory. This algorithm, called Static Fuzzy Bag-of-Words, will be evaluated on the task of Semantic Textual Similarity and will reveal competitive with other sentence embedding algorithms. In this context it will also be shown that the similarity measure adopted by the Static Fuzzy Bag-of-Words, namely the Fuzzy Jaccard index, achieves better results with respect to cosine similarity when comparing max-pooled vectors, highlighting once again that the similarity measure to adopt represents a fundamental aspect for the evaluation of the algorithm itself.

Quello del Natural Language Processing (in breve NLP) è un settore che negli ultimi anni ha attirato molta attenzione da parte della comunità scienti ca grazie ai sorprendenti risultati raggiunti, che si sostanziano ad esempio in agenti conversazionali e sistemi di traduzione automatica. Una chiave di volta per la ricerca in questo ambito è stata l'introduzione di embeddings testuali, ovvero vettori capaci di rappresentare il significato semantico di parole o frasi in uno spazio Euclideo; questi vettori sono ottenuti tramite algoritmi non supervisionati che ricevono come input una enorme quantità di testo. L'idea alla base degli embedding è che parole o frasi dal significato semantico simile siano associate a vettori "vicini" tra loro. Misurare la distanza di questi vettori diventa quindi un problema rilevante quanto la creazione delle rappresentazioni stesse e che spesso in letteratura viene trascurato. In questa tesi dapprima verranno evidenziati i limiti della cosine similarity (la misura di similarità più utilizzata in ambito embedding) e verranno proposte soluzioni alternative ad essa, quindi verrà illustrato un nuovo modello di sentence embedding basato sulla teoria degli insiemi fuzzy. Questo algoritmo, chiamato Static Fuzzy Bag-of-Words, sarà valutato con il task Semantic Textual Similarity e si rivelerà competitivo con altri modelli di sentence embedding. In questo contesto inoltre verrà mostrato come il Fuzzy Jaccard index, ovvero la misura di similarità adottata dal modello Static Fuzzy Bag-of-Words, raggiunga risultati migliori rispetto alla cosine similarity nel comparare vettori max-pooled, evidenziando ancora una volta come la misura di similarità da adottare rappresenti un aspetto fondamentale nella valutazione dell'algoritmo stesso.