Synthetic speech detection through convolutional neural networks in noisy environments

Nowadays, deepfakes are well known to a lot of people, from forensic analysis experts to teenagers on social media. Social networks have a big role in the spreading of these artificial generated media and there is a reason. Video, images and speech deepfakes are entertaining and they easily attract the interest of many because of the remarkable resemblance they have with respect to the real people they mimic. To the genuine interest on this subject for the creation of entertainment content, we have to add the one influenced by the malicious means these technologies may be used for. Because of these spiteful intents, phenomena like people impersonation have the likelihood to spread worldwide at increasing speed, with the concurrent fast creation of new algorithms for media synthesis. Furthermore, the always evolving improvements affecting Machine Learning are a powerful aid for these purposes. Understandably, the rise of deepfakes techniques coincides with much deeper investigations on synthetic media detection. Video deepfakes detection algorithms have the richest literature in this field. Conversely, audio deepfake detection has been less investigated and needs attention. In this thesis we propose a method for the classification of synthetic speech excerpts in noisy environments based on Convolutional Neural Networks (CNNs). The proposed system is a composition of a pre-processing denoising DnCNN followed by a VGGish Convolutional Network that acts as classifier. The two networks are jointly trained in an end-to-end framework. Our results confirm our expectation, showing that our end-to-end approach outperforms other solutions based on disjoint denoising and classification.

Oggigiorno, i deepfake sono conosciuti molto bene da molte persone, a partire da esperti di analisi forense fino a ragazzi sui social media. I social network hanno un importante ruolo nella diffusione di questi contenuti multimediali, e per una ragione. Deepfake di video, immagini e registrazioni forniscono un ottimo intrattenimento e incuriosiscono facilmente molti grazie alle impressionanti somiglianze che hanno rispetto alle persone che imitano. Al genuino interesse su questi file per la creazione di contenuti di intrattenimento, dobbiamo aggiungere quello guidato dagli scopi malevoli per cui si può utilizzare questa tecnologia. A causa di questi fini illegali, fenomeni come lo scambio d'identità hanno il potenziale di diffondersi molto rapidamente in tutto il mondo, con un veloce incremento nella creazione di algoritmi per la sintesi di contenuti multimediali. Inoltre, i continui sviluppi nel campo del Machine Learning sono un potente strumento per questi intenti. Comprensibilmente, l'aumento di tecniche per generare deepfake coincide con maggiori indagini per il rilevamento di file multimediali creati artificialmente. La letteratura riguardante il rilevamento deepfake video è la più ricca in questo ambito. Al contrario, l'individuazione di deepfake audio è meno trattata e necessita di maggiore attenzione. In questa tesi proponiamo un metodo per la classificazione di tracce vocali in ambienti rumorosi basato su reti neurali convoluzionali (CNN). Il sistema proposto è composto da una DnCNN usata come riduttore del rumore preliminare seguita da una rete convoluzionale VGGish che agisce come classificatore. Le due reti sono allenate congiuntamente in una struttura end-to-end. I risultati confermano le nostre aspettative, mostrando che l'approccio end-to-end supera di gran lunga soluzioni basate su riduzione del rumore disgiunta dalla classificazione.