Closed and open set classification of real and AI synthesised speech

Huge advancements in the development of artificial intelligence techniques have been made in the last decade, which have led to the diffusion and spread of computer generated multimedia content, consisting of images, audio and video, which is so realistic that it makes it difficult to be told apart from original content of the same nature. While there are interesting applications to artificial intelligence generated content, it can also be used in dangerous and deceiving ways, for example as proof in a court of law. Hence it is more and more urgent to find automatic ways to distinguish artificial intelligence synthesised content from original content. In this paper we take into account audio content, and in particular speech, which is obviously utterly delicate as it comes to forgery. We are going to deepen the previous research made in the field of bispectral analysis in order to create more general automatic methods to recognise real speakers from artificial intelligence synthesised speech. The dataset of voices that we have used is very wide and heterogeneous, consisting of both real voices and voices synthesised using various different methods. We extracted the Bicoherence from all the speech recordings and performed some classifications (both multilabel classifications, which consist in distinguishing each class of voices from all the others, and binary classifications between real and fake voices) using various machine learning techniques, such as support vector machine, logistic regression and random forest. In particular, once the bicoherences have been computed from the audio files, we performed the following tests. First of all we replicated the tests made on previous works extracting from the bicoherences a set of features which consist on mean, variance, skewness and kurtosis of both the modules and the phases of the bicoherences and trying to classify them performing simple multiclass and binary classifications using a support vector machine, a series of logistic regressors and a random forest. Then we simulated an open set environment using a series of support vector machines, in order to test the model with data not yet seen in the training phase. Moreover, we used a series of U-Nets to extract a new set of features and tried to classify them performing simple multi-label and binary classifications. Finally, we concatenated the two set of features above and performed more classifications with them (in this case also with an open set environment) and we are going to show that with this method we obtained the best results. We hope that these results can clarify better the role of bispectral analysis in distinguishing between real and fake speech recordings, and could lead to more research in the field of multimedia forensics.

Nell’ultimo decennio sono stati fatti grandi passi avanti nello sviluppo di tecniche di intelligenza artificiale, che hanno permesso la diffusione in larga scala di contenuti multimediali generati artificialmente, come audio, immagini e video che sono talmente realistici da risultare praticamente impossibili da distinguere da contenuti multimediali originali dello stesso tipo. Nonostante ci siano molte interessanti applicazioni per questo materiale generato algoritmicamente, questo può essere usato anche in modo improprio e pericoloso, per esempio come prova in una corte di giustizia. Per questo motivo, risulta sempre più urgente trovare modi automatici per distinguere contenuti multimediali generati da intelligenze artificiali da quelli originali. In questa tesi prendiamo in considerazione contenuti audio, e ci occuperemo in particolare del parlato, che è ovviamente molto delicato nel momento in cui venga “falsificato”. Andremo ad ampliare la ricerca fatta in lavori precedenti riguardo l’analisi bispettrale al fine di creare metodi automatici il più possibile generali per distinguere il parlato sintetico da quello registrato da parlatori veri. Il dataset che abbiamo usato è molto ampio ed eterogeneo, e consiste sia di voci “vere” che di voci generate algoritmicamente con vari metodi di sintetizzazione. Abbiamo estratto le bicoerenze da tutte le registrazioni di parlato, e abbiamo provato a fare delle classificazioni (sia classificazioni multi-label, che consistono nel distinguere ogni classe da tutte le altri, sia classificazioni binarie, per distinguere le voci vere da quelle finte) usando vari metodi di machine learning come support vector machine, logistic regression e random forest. In particolare, una volta calcolate le bicoerenze, abbiamo eseguito i seguenti test: per prima cosa abbiamo replicato gli esperimenti fatti in lavori precedenti estraendo dei descrittori (che consistono nei primi quattro momenti statistici) sia dai moduli che dalle fasi delle bicoerenze e provando a classificarli per mezzo di semplici classificazioni (sia multi-label che binarie) usando una support vector machine, una serie di logistic regression e una Random forest; a seguire abbiamo simulato un ambiente open set usando una serie di support vector machine, al fine di testare il modello con dati mai visti nella fase di training; inoltre abbiamo usato delle reti neurali chiamate U-Net per estrarre un nuovo insieme di descrittori che abbiamo provato a classificare tramite semplici classificazioni multi-label e binarie; infine abbiamo concatenato i due insiemi di descrittori di cui sopra e abbiamo provato a classificarli, sta volta simulando anche un ambiente open set, e mostreremo che in questo modo si ottengono i risultati migliori. Ci auguriamo che il lavoro fatto possa chiarire meglio il ruolo che ha l’analisi bispettrale nel distinguere il parlato vero da quello sintetizzato da intelligenza artificiale, e possa aprire la strada ad altra ricerca nell’ambito dell’audio forense.