This thesis aimed to develop an Automatic Speech Recognition (ASR) system for the Italian language. The traditional ASR systems need to have aligned input-output for training the acoustic model; this applies to both the Gaussian mixed/hidden Markov models and the deep neural network/hidden Markov. This requires a time-consuming pre-processing phase. More recently, the end-to-end neural networks include the linguistic model and the acoustic model in one single network; this approach let to not use an aligned corpus and thus a faster and more flexible training, also with different languages, without modifying the model's structure. In our experiments, we used a model similar to wav2vec of Facebook; its peculiarity is the absence of recurrent neural networks, in favour of convolutional ones. Our trained model has achieved a Word Error Rate (WER) of 22.1% on our test set, of 315 hours of audios.
Questa tesi mira alla costruzione di un sistema di trascrizione automatica (ASR) per la lingua italiana. I tradizionali sistemi ASR necessitano un allineamento input-output per addestrare il modello acustico; questo vale sia per i modelli di Gaussian mixed/hidden Markov e per i modelli di deep neural network/hidden Markov. Ciò richiede una lunga fase di pre-elaborazione dei dati. Le più moderne reti neurali end-to-end, comprendono il modello linguistico e acustico in un'unica rete; tale approccio permette di non usare un corpus con input e output allineati e quindì un training più flessibile e veloce con linguaggi diversi, senza modifiche alla struttura del modello. Durante i nostri esperimenti, abbiamo usato un modello simile a wav2vec, di Facebook; la peculiarità di wav2vec è la completa assenza di reti neurali ricorrenti, ma solo di reti neurali convoluzionali. Il nostro modello addestrato ha ottenuto un errore per parola (WER) del 22.1% sul nostro set di set, composto da circa 315 ore di audio.
Fine-tuning Wav2Vec 2.0 for Italian Speech
BONAZZA, ELIA
2020/2021
Abstract
This thesis aimed to develop an Automatic Speech Recognition (ASR) system for the Italian language. The traditional ASR systems need to have aligned input-output for training the acoustic model; this applies to both the Gaussian mixed/hidden Markov models and the deep neural network/hidden Markov. This requires a time-consuming pre-processing phase. More recently, the end-to-end neural networks include the linguistic model and the acoustic model in one single network; this approach let to not use an aligned corpus and thus a faster and more flexible training, also with different languages, without modifying the model's structure. In our experiments, we used a model similar to wav2vec of Facebook; its peculiarity is the absence of recurrent neural networks, in favour of convolutional ones. Our trained model has achieved a Word Error Rate (WER) of 22.1% on our test set, of 315 hours of audios.File | Dimensione | Formato | |
---|---|---|---|
Tesi magistrale Bonazza Elia.pdf
accessibile in internet solo dagli utenti autorizzati
Dimensione
5.4 MB
Formato
Adobe PDF
|
5.4 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/189027