This thesis explores the application of clustering techniques to Reddit posts from two Italian subreddits, "Universitaly" and "TeenagerItaly," with the goal of identifying prevalent themes and analyzing the sentiment associated with each cluster. The study employs both traditional text processing methods and advanced machine learning techniques, such as BERT embeddings, to preprocess the data before applying K-Means and DBSCAN clustering algorithms. The per formance of these clustering models is evaluated using metrics such as Silhouette Score and Davies-Bouldin Index, allowing for a comparative analysis between dif ferent preprocessing approaches. The research also includes a sentiment analysis to better understand the emotional tone within each cluster. The results demonstrate that traditional text processing and BERT embeddings lead to different cluster ing outcomes, with BERT embeddings typically resulting in fewer, more coherent clusters. The inclusion of sentiment analysis provides additional insights into the emotional characteristics of each cluster, contributing to a deeper understanding of the discussions within the subreddits analyzed. This study aims to contribute to the field of social media analytics by offering a comprehensive analysis that com bines thematic clustering with sentiment evaluation.
Questa tesi esplora l’applicazione di tecniche di clustering ai post di Reddit provenienti da due subreddit italiani, "Universitaly" e "TeenagerItaly", con l’obiettivo di identificare i temi principali e analizzare il tono emotivo associato a ciascun cluster. Per raggiungere questi obiettivi, sono stati impiegati sia metodi tradizionali di elaborazione del testo, come TF-IDF, sia tecniche avanzate di apprendimento automatico, come gli embedding di BERT. I dati, dopo essere stati preprocessati, sono stati analizzati utilizzando gli algoritmi di clustering K-Means e DBSCAN, le cui prestazioni sono state valutate tramite metriche standard quali il Silhouette Score e il Davies-Bouldin Index. I risultati del clustering hanno mostrato differenze significative tra i metodi di preprocessamento: TF-IDF ha prodotto cluster più generali, mentre BERT ha generato cluster più specifici e coerenti, sebbene con una granularità maggiore. L’analisi del sentiment, condotta su ciascun cluster, ha rivelato variazioni nei toni emo tivi: i post di "Universitaly" hanno mostrato un tono prevalentemente neutrale e positivo, mentre quelli di "TeenagerItaly" hanno evidenziato una maggiore prevalenza di sentiment negativo. Questa ricerca contribuisce all’analisi dei social media dimostrando l’efficacia di tecniche di clustering e senti ment analysis combinate nell’estrazione di approfondimenti significativi dalle discussioni online. Le implicazioni pratiche includono applicazioni per la moderazione delle comunità, il miglioramento dell’esperienza utente e l’analisi delle dinamiche sociali.
Clustering Reddit posts using traditional text processing and BERT embeddings: a comparative analysis and sentiment evaluation
Lima, Simone
2024/2025
Abstract
This thesis explores the application of clustering techniques to Reddit posts from two Italian subreddits, "Universitaly" and "TeenagerItaly," with the goal of identifying prevalent themes and analyzing the sentiment associated with each cluster. The study employs both traditional text processing methods and advanced machine learning techniques, such as BERT embeddings, to preprocess the data before applying K-Means and DBSCAN clustering algorithms. The per formance of these clustering models is evaluated using metrics such as Silhouette Score and Davies-Bouldin Index, allowing for a comparative analysis between dif ferent preprocessing approaches. The research also includes a sentiment analysis to better understand the emotional tone within each cluster. The results demonstrate that traditional text processing and BERT embeddings lead to different cluster ing outcomes, with BERT embeddings typically resulting in fewer, more coherent clusters. The inclusion of sentiment analysis provides additional insights into the emotional characteristics of each cluster, contributing to a deeper understanding of the discussions within the subreddits analyzed. This study aims to contribute to the field of social media analytics by offering a comprehensive analysis that com bines thematic clustering with sentiment evaluation.File | Dimensione | Formato | |
---|---|---|---|
2025_04_Lima.pdf
accessibile in internet per tutti
Dimensione
1.22 MB
Formato
Adobe PDF
|
1.22 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/234533