Phoenix : deep speech based automatic speech recognition system for Italian language

Biblioteche e Archivi
POLITesi - Archivio digitale delle tesi di laurea e di dottorato

This thesis aimed at developing a system that can transcribe all the Italian audio that containing human speech into text. A basic part of ASR system development is to collect large and heterogeneous corpus of audio and their transcription, so that the system can build a general representation of the language. In view of the fact that a few existing systems tailored for Italian are developed on a small corpus, so it is necessary to build a new system specifically for Italian and train existing solutions on a larger corpus. Deep Speech is a state-of-art speech recognition system that using end-toend deep learning. This architecture is different from traditional speech systems. Traditional systems have a bad performance in noisy environments. On the contrary, Deep Speech can model background noise, reverberation or speaker changes, without manually designed components, and can directly learn the functions with robustness to such effects. Furthermore, training does not require providing a lexicon of phonemes, as is the case with traditional approaches. The key method of Deep Speech is to use the optimized RNN training system that uses GPUs and a set of novel data synthesis techniques, which can effectively obtain a large number of diverse data for training. Our system, called Phoenix, reaches a WER of 13.8% and exceeds the previous prototype, based on the Kaldi toolkit. The result shows that Phoenix has good accuracy and confirms that the neural netwrok based approach is better than the traditional one.

Questa tesi mirava a sviluppare un sistema di trascrizione automatica (ASR), per la lingua italiana. Una parte fondamentale dello sviluppo del sistema ASR `e quella di raccogliere un ampio ed eterogeneo corpus di registrazioni audio, con le relative trascrizioni, in modo che il sistema possa costruire una rappresentazione generale della lingua. Allo stato attuale, non esistono ASR open source, basati su corpus e vocabolari di grandi dimensioni. Da qui la necessit‘a di sviluppare un nuovo strumento. Deep Speech `e un sistema di riconoscimento vocale all’avanguardia che utilizza tecniche di deep learning end-to-end. Questa architettura `e diversa dai sistemi vocali tradizionali. I sistemi tradizionali hanno prestazioni scadenti in ambienti rumorosi. Al contrario, Deep Speech `e in grado di modellare il rumore di fondo, il riverbero o i cambiamenti degli altoparlanti. Inoltre, l’addestramento non richiede di fornire un lessico di fonemi, come avviene per gli approcci tradizionali. Deep Speech si basa sulle reti neurali ricorrenti (RNN), ed ‘e ottimizzato per sfruttare le GPU. Il nostro sistema, chiamato Phoenix, raggiunge un WER del 13.8% WER e supera il precedente prototipo, basato sul toolkit Kaldi. I risultati mostrano che Phoenix ha una buona accuratezza e conferma come l’approccio basato su reti neural sia superiore a quello tradizionale.

Phoenix : deep speech based automatic speech recognition system for Italian language

CHEN, WEIBIN

2018/2019

Abstract

This thesis aimed at developing a system that can transcribe all the Italian audio that containing human speech into text. A basic part of ASR system development is to collect large and heterogeneous corpus of audio and their transcription, so that the system can build a general representation of the language. In view of the fact that a few existing systems tailored for Italian are developed on a small corpus, so it is necessary to build a new system specifically for Italian and train existing solutions on a larger corpus. Deep Speech is a state-of-art speech recognition system that using end-toend deep learning. This architecture is different from traditional speech systems. Traditional systems have a bad performance in noisy environments. On the contrary, Deep Speech can model background noise, reverberation or speaker changes, without manually designed components, and can directly learn the functions with robustness to such effects. Furthermore, training does not require providing a lexicon of phonemes, as is the case with traditional approaches. The key method of Deep Speech is to use the optimized RNN training system that uses GPUs and a set of novel data synthesis techniques, which can effectively obtain a large number of diverse data for training. Our system, called Phoenix, reaches a WER of 13.8% and exceeds the previous prototype, based on the Kaldi toolkit. The result shows that Phoenix has good accuracy and confirms that the neural netwrok based approach is better than the traditional one.

Scheda breve

Scheda completa

	Relatore
	
				SBATTELLA, LICIA
			
	Correlatore/i
	
				TEDESCO, ROBERTO
			
	Scuola / Dip.
	
				ING  - Scuola di Ingegneria Industriale e dell'Informazione
			
	Data
	
				18-dic-2019
			
	Anno accademico
	
				2018/2019
			
	Abstract in italiano
	
				Questa tesi mirava a sviluppare un sistema di trascrizione automatica (ASR),
per la lingua italiana.
Una parte fondamentale dello sviluppo del sistema ASR `e quella di raccogliere
un ampio ed eterogeneo corpus di registrazioni audio, con le relative trascrizioni, in
modo che il sistema possa costruire una rappresentazione generale della lingua.
Allo stato attuale, non esistono ASR open source, basati su corpus e vocabolari
di grandi dimensioni. Da qui la necessit‘a di sviluppare un nuovo strumento.
Deep Speech `e un sistema di riconoscimento vocale all’avanguardia che utilizza
tecniche di deep learning end-to-end. Questa architettura `e diversa dai sistemi
vocali tradizionali. I sistemi tradizionali hanno prestazioni scadenti in ambienti
rumorosi. Al contrario, Deep Speech `e in grado di modellare il rumore di fondo, il
riverbero o i cambiamenti degli altoparlanti. Inoltre, l’addestramento non richiede
di fornire un lessico di fonemi, come avviene per gli approcci tradizionali.
Deep Speech si basa sulle reti neurali ricorrenti (RNN), ed ‘e ottimizzato per
sfruttare le GPU.
Il nostro sistema, chiamato Phoenix, raggiunge un WER del 13.8% WER e
supera il precedente prototipo, basato sul toolkit Kaldi.
I risultati mostrano che Phoenix ha una buona accuratezza e conferma come
l’approccio basato su reti neural sia superiore a quello tradizionale.
			
	Tipo di documento
	
				Tesi di laurea Magistrale
			
	Appare nelle tipologie:
	
				Tesi di laurea Magistrale

File allegati

File	Dimensione	Formato
Tesi.pdf accessibile in internet per tutti Descrizione: Thesis Dimensione 1.42 MB Formato Adobe PDF Visualizza/Apri	1.42 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/152319