Nowadays thousands of news are spread online every day, both on the official websites of newspapers and on social networks. The ability to determine the similarity between two newspaper articles is the basis of many studies involving news, as it allows to develop further tasks such as grouping news by similarity, creating personalized news recommendation systems, or monitoring the news media coverage of particular topics. In this thesis, we propose new models to determine the similarity between multilingual news articles using state-of the art technology based on Transformers. These models have been developed during our participation in Task 8 of the SemEval 2022 competition, in which we rank 4th on the leaderboard that reports the best performance per team. To assess the quality of our models on real data we focus our attention on a specific case study. We consider five of the main Italian newspapers (Ansa, Repubblica, Il Giornale, La Stampa, Corriere) and collect the first-five news per day proposed by their homepages. We use the developed models in conjunction with two custom clustering algorithms, to determine the similarity among the newspapers throughout the year 2021. This similarity is computed considering the labels of the clusters to which the first 5 daily news of each newspaper belong. In the period of time examined, all the pairs of analyzed newspapers show considerable daily variations in their similarity. It is however possible to obtain an idea of what is the similarity between the editorial lines of the various newspapers by looking at the average and median of the daily distances computed over the entire period of examination. Il Giornale reports the greatest distance from other newspapers, suggesting that it follows an editorial line that differs from the others. In general, the linguistic models developed shown to be able to preserve the significant semantic differences present in the articles
Al giorno d'oggi migliaia di notizie vengono diffuse online ogni giorno, sia sui siti ufficiali delle testate giornalistiche che sui social network. La capacità di determinare la somiglianza tra due articoli di giornale è alla base di molti studi che coinvolgono le notizie, in quanto consente di sviluppare ulteriori attività come raggruppare le notizie per somiglianza, creare sistemi di raccomandazione di notizie personalizzati o monitorare la copertura mediatica di argomenti particolari. In questa tesi, proponiamo nuovi modelli per determinare la somiglianza tra articoli di notizie multilingue utilizzando la tecnologia all'avanguardia basata sui Transformers. Questi modelli sono stati sviluppati durante la nostra partecipazione alla Task 8 della competizione SemEval 2022, in cui ci classifichiamo al 4° posto nella classifica che riporta le migliori prestazioni per squadra. Per valutare la qualità dei nostri modelli su dati reali concentriamo la nostra attenzione su un caso di studio specifico. Prendiamo in esame cinque dei principali quotidiani italiani (Ansa, Repubblica, Il Giornale, La Stampa, Corriere) e raccogliamo le prime cinque notizie al giorno proposte dalle loro homepage. Utilizziamo i modelli sviluppati insieme a due algoritmi di clustering personalizzati, per determinare la somiglianza tra i giornali durante tutto l'anno 2021. Questa somiglianza è calcolata considerando le etichette dei cluster a cui appartengono le prime 5 notizie quotidiane di ciascun quotidiano. Nell'arco di tempo preso in esame, tutte le coppie di giornali analizzati presentano variazioni giornaliere considerevoli nella loro somiglianza. È tuttavia possibile farsi un'idea di quale sia la somiglianza tra le linee editoriali dei vari giornali guardando la media e la mediana delle distanze giornaliere calcolate sull'intero periodo di esame. Il Giornale riporta la maggiore distanza dagli altri giornali, suggerendo che segua una linea editoriale diversa dalle altre. In generale, i modelli linguistici sviluppati si sono dimostrati in grado di preservare le significative differenze semantiche presenti negli articoli
Development of new algorithms for computing news similarity and their application in a case study considering five Italian newspapers
Tasca, Thomas
2021/2022
Abstract
Nowadays thousands of news are spread online every day, both on the official websites of newspapers and on social networks. The ability to determine the similarity between two newspaper articles is the basis of many studies involving news, as it allows to develop further tasks such as grouping news by similarity, creating personalized news recommendation systems, or monitoring the news media coverage of particular topics. In this thesis, we propose new models to determine the similarity between multilingual news articles using state-of the art technology based on Transformers. These models have been developed during our participation in Task 8 of the SemEval 2022 competition, in which we rank 4th on the leaderboard that reports the best performance per team. To assess the quality of our models on real data we focus our attention on a specific case study. We consider five of the main Italian newspapers (Ansa, Repubblica, Il Giornale, La Stampa, Corriere) and collect the first-five news per day proposed by their homepages. We use the developed models in conjunction with two custom clustering algorithms, to determine the similarity among the newspapers throughout the year 2021. This similarity is computed considering the labels of the clusters to which the first 5 daily news of each newspaper belong. In the period of time examined, all the pairs of analyzed newspapers show considerable daily variations in their similarity. It is however possible to obtain an idea of what is the similarity between the editorial lines of the various newspapers by looking at the average and median of the daily distances computed over the entire period of examination. Il Giornale reports the greatest distance from other newspapers, suggesting that it follows an editorial line that differs from the others. In general, the linguistic models developed shown to be able to preserve the significant semantic differences present in the articlesFile | Dimensione | Formato | |
---|---|---|---|
Development_of_news_similarity_algorithms_and_their_application_in_determining_the_similarity_between_five_Italian_newspapers.pdf
solo utenti autorizzati a partire dal 04/07/2025
Dimensione
1.23 MB
Formato
Adobe PDF
|
1.23 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/189652