Study of multiple dictionaries in exemplar-based NMF for speech enhancement

Growing in importance, especially over the last years, speech enhancement has been an important research topic due to the fact that it is required in many applications in the daily life. Speech enhancement and noise reduction aim to improve the speech quality, intelligibility and overall perceptual clarity of a noisy signal by removing the unwanted noise using several techniques. The traditional noise reduction techniques, such as the Wiener filtering or the Spectral Subtraction do not work satisfactorily in the presence of real non-stationary background noise. In order to overcome this problem, we decided to use a different technique, the Non-Negative Matrix Factorization (NMF) jointly with a sparse representation method. The NMF is a class of algorithm where a matrix V is factorized into two matrices, W and H, with the property that all of them have no negative elements. The NMF has applications in many fields like computer vision, document clustering and recommender systems but also in spectral analysis, becoming widely used as a source separation technique. Recently, NMF has also been applied to estimate the clean speech from a noisy observation. We use this technique in order to obtain the weight matrix H of a matrix W, called dictionary, that multiplied approximate the noisy observation V. Once obtained, the activation matrix H can be used, together with a dictionary W, for the reconstruction of the enhanced input noisy signal. The purpose of this thesis is to give a proof-of-concept for a later development of a more general real-time sparse NMF algorithm for speech enhancement of any speech signal. The algorithm devised is able to reconstruct an enhanced version of every speech signal corrupted by non-stationary real background noise using different training signals. In particular, we investigate the importance of the training dictionary, obtained from the training signal, in the factorization part. We use three approaches. The first consists in using the same noisy and clean utterances respectively for the NMF decomposition and for the reconstruction. The second approach uses directly the clean speech for both NMF factorization and reconstruction. The third approach also uses the clean speech for both tasks, but with an additional refinement using vocal features on the speech dictionary. To do so, we investigate the fundamental aspects to consider for the NMF factorization and the enhanced reconstruction of a noisy observation, such as the dictionary size, the bases dimension and the sparsity constraint. Comparing different settings of these features, turns out that the second approach, that uses clean dictionaries, obtains the best results. However, with more specific study over the vocal feature extraction, the third approach can be faster and as good as the actual best.

Negli ultimi anni lo speech enhancement ha gradualmente visto aumentare la propria importanza, fino a divenire un importante argomento di ricerca in virtù delle sue numerose applicazioni nella vita quotidiana. Speech enhancement e noise reduction mirano al miglioramento della qualità della voce, dell'intelligibilità e della chiarezza percepita di un segnale rumoroso mediante molteplici tecniche di rimozione del rumore indesiderato. I metodi tradizionali di riduzione del rumore, come Wiener filtering o Spectral Subtraction non ottengono risultati soddisfacenti in presenza di rumori di fondo reali e non stazionari. Per superare questa limitazione, si è deciso di utilizzare una tecnica diversa basata sulla Non-Negative Matrix Factorization (NMF), congiuntamente ad un metodo di sparse representation. La NMF consiste in un gruppo di algoritmi nei quali la matrice V viene fattorizzata in due matrici W e H, dove tutte e tre le matrici hanno la proprietà di essere composte da elementi non negativi. Lo scopo di questo elaborato è quello di fornire un'adeguata base teorica per il successivo sviluppo di un algoritmo più generale di sparse NMF in tempo reale, che sia in grado di migliorare il parlato di un segnale vocale. Un algoritmo così concepito è in grado di ricostruire una versione migliorata di qualsiasi segnale vocale, deteriorato da un rumore di fondo reale e non stazionario, utilizzando registrazioni diverse nella fase di training. In particolare, verrà analizzata l'importanza del dizionario ottenuto in questa fase di training in vista della successiva fattorizzazione non negativa (NMF). Verranno adottati tre approcci distinti. Il primo di questi consiste nell'usare la stessa frase registrata in due momenti differenti, prima senza disturbi e poi in ambiente rumoroso, rispettivamente per la ricostruzione e per la decomposizione. Il secondo approccio utilizza direttamente la registrazione pulita tanto per la decomposizione quanto per la ricomposizione. Infine, il terzo approccio fa uso anch'esso della registrazione priva di disturbi per entrambi i compiti, ma allo stesso tempo ricorre all'estrazione di alcune caratteristiche vocali per un ulteriore perfezionamento del dizionario ottenuto in fase di training. Per fare ciò sono stati indagati gli aspetti fondamentali da tenere in considerazione per garantire la miglior fattorizzazione e ricostruzione, come ad esempio il metodo di selezione delle basi e la loro dimensione, la dimensione dei dizionari usati e gli sparsity constraint. Utilizzando in congiunto diverse configurazioni di questi fattori all'interno dei tre diversi approcci, si giungerà alla conclusione che la soluzione migliore consiste nel secondo dei tre metodi applicati. Tuttavia, attraverso uno studio più specifico dell'estrazione e della classificazione delle caratteristiche vocali, si dimostrerà che anche il terzo approccio può portare agli stessi risultati, riuscendo ad essere più veloce del metodo corrente.