Identifying fake news by learning to predict whether textual evidence supports or refutes its claims

This thesis discusses the problem of "seeking the truth", a broad problem with several implications that in a "tech world" has a huge impact. The problem is shown in different contexts, from a data quality perspective and from a journalism one, always trying to focus on how the latest technologies, such as machine learning, may be used to deal with these problems. While the root problem is analyzed using a state-of-the-art analysis and a study of the approaches implemented over the years, my work focuses on trying to solve the challenge of automating fact checking. Fact checking is an important journalism practice that aims at assessing the factuality of claims by means of investigation, it is particularly crucial in political discourse, where incorrect opinions, expressed as facts, may lead to a distorted perception of the truth. Manual fact-checking is a laborious task, it is usually considered tricky for non-professionals and generally it is a team of fact checkers that is in charge of providing a definitive judgment for a given claim. In this document I investigate to what extent the manual process can be automated, in particular my goal is to show that it is possible to obtain a system that can be used in a real-world context with the aim of providing useful metrics to the fact checkers, in one or more of the subtasks in which the problem can be split into. In this work I have had the chance to compare several word and sentence embeddings, as well as language models and apply them in order to solve the problem. This document is intended to show a thorough explanation of all the results that I achieved while working on this task.

Questa tesi tratta il problema della "ricerca della verità" applicato in un contesto digitale, un problema molto ampio, con varie implicazioni, e un grande impatto nell’era moderna. L’argomento è stato studiato in maniera diversa a seconda del contesto in cui viene posto, è stato analizzato come problema relativo alla qualità dei dati in un contesto di Big Data e come problema in un processo giornalistico, ovvero la "verifica dei fatti". La radice del problema è stata studiata analizzando lo stato dell’arte e l’evoluzione delle soluzioni nel corso degli anni. Il mio lavoro in questa tesi consiste nel provare a risolvere la sfida dell’automatizzare il processo di verifica dei fatti. La verifica dei fatti è un’importante pratica giornalistica che tenta di investigare la correttezza delle affermazioni pubbliche, è una partica cruciale in diversi contesti, come quello politico, dove opinioni incorrette espresse sotto forma di fatti possono portare ad una percezione sbagliata della realtà. Tale processo giornalistico, quando viene fatto in maniera manuale, è un processo lento e complesso, viene infatti solitamente eseguito da un team di esperti, con il compito di proporre un giudizio definitvo ad ogni affermazione in dubbio. In questo documento analizzo fino a che punto tale verifica può essere automatizzata, cercando di mostrare se è possibile realizzare un sistema che può essere applicato in un contesto reale, al fine di aiutare il team di verifica dei fatti con metriche utili che possono facilitarne il lavoro. Con questa tesi ho avuto l’opportunita di comparare diverse techniche nell’ambito del "Deep Learning for NLP", come embeddings e language models, per applicarle al fine di risolvere un problema reale. Questo documento fornisce una spiegazione dettagliata di tutti i miei risultati ottenuti nel lavorare a questo compito.