Deepfakes videos are synthetic AI-generated media in which the traits of a target person are superimposed in the face of a source person. These videos are mostly generated using Deep Learning algorithms, and consequently, the output resolution of these generators depends on the quality of the data used to train those models. It is nowadays easy to find many high-quality pictures of celebrities and politicians, making them a convenient target for these kinds of forgeries. Powerful deepfakes detectors have been developed along with deepfakes generators, but the rationale that leads a detector to a specific prediction is not entirely clear yet. In this work, we propose a multi-granular approach that increments the explainability level for a deepfake detector. In particular, we show that it is reasonable to ease the interpretation of a binary deepfake detector prediction employing three approaches with different output granularity: image segmentation, multiclass classification, and image captioning. We also propose some architectural enhancements of the already existing captioning model CoAttention, making it more scalable and efficient, and simultaneously improving its performances as measured by BLEU scores. We also design different workflows to create effective datasets used to show the feasibility of our approach. Finally, we present our crowdsourcing approach used to collect a dataset of human-generated text data, containing statements explaining the reasons why a human might consider a certain video frame to be deepfake. In this work, we propose a multi-granular approach that increasingly increments the explainability level for a deepfake detector. In particular, we show that it is reasonable to ease the interpretation of a binary deepfake detector prediction employing three approaches with different output granularity: image segmentation, multiclass classification, and image captioning. We also propose some architectural enhancements of the already existing captioning model CoAttention, making it more scalable and efficient, and simultaneously improving its performances as measured by BLEU scores. We also design different workflows to create effective datasets used to show the feasibility of our approach. Finally, we present our crowdsourcing approach used to collect a dataset of human-generated text data, containing statements explaining the reasons why a human being might consider a certain video frame to be deepfake.
I deepfakes sono video sintetici generati tramite l'utilizzo dell'intelligenza artificiale in cui il viso di una persona viene sostituito con il volto di un'altra. Questi video sono per lo più generati utilizzando algoritmi di Deep Learning, e la risoluzione dell'output di questi generatori dipende fortemente dalla qualità dei dati utilizzati per addestrare questi modelli. Tuttavia, al giorno d'oggi è davvero facile trovare un'enorme quantità di immagini ad alta qualità di celebrità e politici, rendendoli quindi un facile obiettivo per questo tipo di contraffazioni. Nel tempo sono stati sviluppati potenti rilevatori di deepfakes, ma le motivazioni che portano un rilevatore a una specifica predizione non sono ancora del tutto chiare. In questo lavoro, proponiamo un approccio che punta ad aumentare la capacità dei rilevatori di deepfakes di spiegare, con diversi livelli di dettaglio, le motivazioni di una predizione. In particolare, dimostriamo che è possibile migliorare l'interpretabilità di una previsione binaria di un rivelatore di deepfakes impiegando tre approcci con diverse granularità: segmentazione delle immagini, classificazione multiclasse e generazione di testo. Proponiamo anche alcuni miglioramenti architetturali del già esistente modello CoAttention, rendendolo più scalabile ed efficiente, e contemporaneamente migliorando le sue prestazioni misurate usando la metrica BLEU. Presentiamo anche diversi algoritmi per creare set di dati utilizzati per mostrare la fattibilità del nostro approccio. Infine, mostriamo la nostra strategia di crowdsourcing, utilizzata per raccogliere un dataset composto da testo in cui vengono spiegate le ragioni per cui un essere umano potrebbe considerare deepfake un certo frame di un video.
Towards multi-granular explainable AI : increasing the explainability level for deepfake detectors
Alongi, Francesco
2020/2021
Abstract
Deepfakes videos are synthetic AI-generated media in which the traits of a target person are superimposed in the face of a source person. These videos are mostly generated using Deep Learning algorithms, and consequently, the output resolution of these generators depends on the quality of the data used to train those models. It is nowadays easy to find many high-quality pictures of celebrities and politicians, making them a convenient target for these kinds of forgeries. Powerful deepfakes detectors have been developed along with deepfakes generators, but the rationale that leads a detector to a specific prediction is not entirely clear yet. In this work, we propose a multi-granular approach that increments the explainability level for a deepfake detector. In particular, we show that it is reasonable to ease the interpretation of a binary deepfake detector prediction employing three approaches with different output granularity: image segmentation, multiclass classification, and image captioning. We also propose some architectural enhancements of the already existing captioning model CoAttention, making it more scalable and efficient, and simultaneously improving its performances as measured by BLEU scores. We also design different workflows to create effective datasets used to show the feasibility of our approach. Finally, we present our crowdsourcing approach used to collect a dataset of human-generated text data, containing statements explaining the reasons why a human might consider a certain video frame to be deepfake. In this work, we propose a multi-granular approach that increasingly increments the explainability level for a deepfake detector. In particular, we show that it is reasonable to ease the interpretation of a binary deepfake detector prediction employing three approaches with different output granularity: image segmentation, multiclass classification, and image captioning. We also propose some architectural enhancements of the already existing captioning model CoAttention, making it more scalable and efficient, and simultaneously improving its performances as measured by BLEU scores. We also design different workflows to create effective datasets used to show the feasibility of our approach. Finally, we present our crowdsourcing approach used to collect a dataset of human-generated text data, containing statements explaining the reasons why a human being might consider a certain video frame to be deepfake.File | Dimensione | Formato | |
---|---|---|---|
2021_10_Alongi.pdf
accessibile in internet per tutti
Descrizione: Final version of the thesis
Dimensione
11.21 MB
Formato
Adobe PDF
|
11.21 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/180027