Input gain variations can drastically affect the output of deep neural network (DNN)-based real-time speech processing systems. These variations may stem from multiple factors such as changes in the speaker’s distance from the microphone, varying levels of background noise, and different employed microphones with different sensitivity. Such output variations degrades the reliability of DNN-based solutions, hence input gain robustness is essential for ensuring consistent performance in real-time applications. The aim of this thesis includes improving the robustness of deep neural network (DNN)-based real-time speech processing systems against input gain variations by investigating three approaches: Gain-Augmented Training, Differential Features, and Smoothed Frame Normalization. Gain-Augmented Training involves expanding the training dataset with varying gain levels to expose the DNN to a broader range of input amplitudes during training. The Differential Features technique converts the input data into a feature representation that is less affected by gain variations by focusing on the relative changes in the signal rather than its absolute amplitude. Smoothed-Frame Normalization employs a real-time normalization technique to adjust the input signal to a consistent level before processing, ensuring the DNN receives a stably normalized signal. The effects of these three approaches are evaluated with a Voice Activity Detection DNN-based system, which is trained with a standard speech training set and evaluated on a different and more challenging test set. The experimental evaluation shows that each method enhances the robustness of DNN-based speech processing systems against input gain variations by ensuring consistent DNN output, thereby improving performance when different input gain levels are introduced.
Le variazioni del guadagno d’ingresso possono influenzare drasticamente l’output dei sistemi di elaborazione del parlato in tempo reale basati su deeo neural networks (DNN). Queste variazioni possono derivare da diversi fattori come i cambiamenti nella distanza dell’oratore dal microfono, i diversi livelli di rumore di fondo e i microfoni utilizzati con diversa sensibilità. Tali variazioni di output degradano l’affidabilità delle soluzioni basate su DNN, rendendo quindi essenziale la robustezza al guadagno d’ingresso per garantire prestazioni costanti nelle applicazioni in tempo reale. L’obiettivo di questa tesi è migliorare la robustezza dei sistemi di elaborazione del parlato in tempo reale basati su deep neural networks (DNN) contro le variazioni del guadagno d’ingresso, investigando tre approcci: Gain-Augmented Training, Differential Features e Smoothed Frame Normalization. Gain-Augmented Training implica l’espansione del set di dati di addestramento con vari livelli di guadagno per esporre la DNN a una gamma più ampia di ampiezze di ingresso durante l’addestramento. La tecnica di Differential Features converte i dati di ingresso in una rappresentazione di caratteristiche meno influenzata dalle variazioni di guadagno, concentrandosi sui cambiamenti relativi nel segnale piuttosto che sulla sua ampiezza assoluta. Smoothed-Frame Normalization impiega una tecnica di normalizzazione in tempo reale per regolare il segnale di ingresso a un livello coerente prima dell’elaborazione, garantendo che la DNN riceva un segnale stabilmente normalizzato. Gli effetti di questi tre approcci vengono valutati con un sistema di rilevamento dell’attività vocale basato su DNN, addestrato con un set di addestramento standard del parlato e valutato su un set di test diverso e più impegnativo. La valutazione sperimentale mostra che ciascun metodo migliora la robustezza dei sistemi di elaborazione del parlato basati su DNN contro le variazioni del guadagno d’ingresso, garantendo un output DNN coerente e migliorando così le prestazioni quando vengono introdotti diversi livelli di guadagno d’ingresso.
Methods for providing input gain robustness to dnn-based real-time speech processing systems
Ozcan, Yilmaz Ugur
2023/2024
Abstract
Input gain variations can drastically affect the output of deep neural network (DNN)-based real-time speech processing systems. These variations may stem from multiple factors such as changes in the speaker’s distance from the microphone, varying levels of background noise, and different employed microphones with different sensitivity. Such output variations degrades the reliability of DNN-based solutions, hence input gain robustness is essential for ensuring consistent performance in real-time applications. The aim of this thesis includes improving the robustness of deep neural network (DNN)-based real-time speech processing systems against input gain variations by investigating three approaches: Gain-Augmented Training, Differential Features, and Smoothed Frame Normalization. Gain-Augmented Training involves expanding the training dataset with varying gain levels to expose the DNN to a broader range of input amplitudes during training. The Differential Features technique converts the input data into a feature representation that is less affected by gain variations by focusing on the relative changes in the signal rather than its absolute amplitude. Smoothed-Frame Normalization employs a real-time normalization technique to adjust the input signal to a consistent level before processing, ensuring the DNN receives a stably normalized signal. The effects of these three approaches are evaluated with a Voice Activity Detection DNN-based system, which is trained with a standard speech training set and evaluated on a different and more challenging test set. The experimental evaluation shows that each method enhances the robustness of DNN-based speech processing systems against input gain variations by ensuring consistent DNN output, thereby improving performance when different input gain levels are introduced.File | Dimensione | Formato | |
---|---|---|---|
2024_07_Ozcan_Thesis_01.pdf
non accessibile
Descrizione: Text of the Thesis
Dimensione
8.51 MB
Formato
Adobe PDF
|
8.51 MB | Adobe PDF | Visualizza/Apri |
2024_07_Ozcan_Executive Summary_02.pdf
non accessibile
Descrizione: Executive Summary
Dimensione
715.37 kB
Formato
Adobe PDF
|
715.37 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/223344