Fusion of speech separation time/frequency masks using deep neural networks

Speech separation from a mixture signal can be done, e.g., in time/frequency domain by determining appropriate weighting for each time/frequency element. This weighting can be defined utilizing various aspects of the signal and various algorithms, by determining the so called time/frequency mask. However, especially in the case of single-channel mixtures, the performances of the traditional separation methods are not satisfactory. Recently, the fusion of time/frequency masks has been introduced in order to overcome the limitations of each mask individually. Therefore, the aim of this study is to combine the outputs from traditional speech separation methods by using Feedforward Neural Networks (FNN). In order to achieve an improved separation performance with respect to the individual separation methods, we provide a broad investigation of possible FNN models and architectures. Furthermore, new perceptually-weighted cost functions have been proposed, with the purpose of overcoming the limitations of classic objective functions. Finally we evaluate both objective and subjective performances by considering two different tests. The objective evaluation has been done with the classic Blind Audio Source Separation (BASS) performance measurements where some FNN combiners perform better than the individual separation methods, especially regarding the attenuation of the background interference. On the other hand, the subjective evaluation, which consists in a MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor)-like listening test, shows that fusion with FNN does not perform better than the current baseline fusion, confirming that the perceptual cost function and the FNN architecture are still not enough to get relevant improvement regarding the perceived sound quality.

La separazione del segnale vocale da un mix di sorgenti sonore, può essere eseguita nel dominio tempo/frequenza determinando un peso per ciascun elemento tempo/frequenza del mix stesso. Il valore di ciascun peso può essere definito utilizzando vari aspetti del segnale e vari algoritmi, determinando la cosiddetta maschera tempo/frequenza. Ciò nonostante, il rendimento dei tipici algoritmi di separazione non è soddisfacente, in particolare nello scenario a singolo canale. Recentemente è stata introdotta la fusione di più maschere tempo/frequenza per superare i limiti che si presentano nel caso in cui ciascuna maschera opera singolarmente. Pertanto, lo scopo di questa ricerca è di combinare i segnali di uscita dei tipici metodi di separazione del segnale vocale usando le reti neurali "feedforward" (FNN). Per ottenere migliori performance rispetto a ciascun algoritmo che viene combinato, viene fatta un’ampia investigazione di possibili architetture e modelli di FNN. Inoltre sono state introdotte nuove funzioni di costo che considerano l’aspetto percettivo del suono, finalizzate ad oltrepassare i limiti delle funzioni di costo comunemente utilizzate. Per valutare sia le misure oggettive che quelle soggettive abbiamo considerato due test differenti. La valutazione di misure oggettive è stata realizzata utilizzando le classiche misure di performance della Separazione Cieca di Sorgenti Audio (BASS) dove alcuni combinatori basati sulle reti neurali hanno mostrato risultati superiori, in particolare per quanto riguarda le performance legate all’interferenza di altre sorgenti presenti nel segnale vocale. Invece, ai fini di valutare le misure soggettive, è stato condotto un test di ascolto basato sul MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) test, il quale mostra che i risultati ottenuti tramite la fusione basata sulle FNN non superano quelli dell’attuale metodo di fusione dello stato dell’arte. Questo risultato conferma che la funzione di costo percettiva a l’uso delle FNN non sono ancora abbastanza per ottenere un risultato soddisfacente in termini di percezione della qualità del suono.