The thesis is aimed at developing KIMERA: a system capable of performing text- to-audio alignment for the Italian language. Given an audio recording containing human speech with its transcription, the alignment is the temporal specification of the instants at which each word and each one of its phonemes are spoken in the recording. The major issue is that rarely a large amount of speech recordings and transcrip- tions is available, since transcribing hours of speech is a long and expensive process. A possible solution to this problem is to use Automatic Speech Recognition (ASR) systems, which are capable of both, providing those transcriptions and aligning them to the audio, using as input audio recordings alone. A fundamental part in the development of an ASR system is collecting a large and heterogeneous corpus of speech recordings, alongside their transcription, to allow the system to build a general representation of a language. Given that the few preexisting systems tailored to the Italian language are developed on small corpora, it was necessary to build a new system, specific for the Italian language, trained on a larger corpus with respect to preexisting solutions. The ASR developed for this thesis is built inside the Kaldi speech recognition toolkit and trained on an heterogeneous corpus, composed of speech recordings and transcriptions coming from several different sources, to allow the system to build a representation of the Italian language that is as general as possible. Additional processing is done on such corpus to increase the number of words that the system is able to recognize and to increase its robustness to noise. The experimental analysis is done by comparing the alignments obtained with the proposed system with those provided by SPPAS, the only existing open-source toolkit for the Italian language, which is likewise based on an ASR system that, however, is not built on a corpus that is as large as the one used to develop KIMERA. Results show that the system developed as the result of this thesis, aside from being the only publicly available large vocabulary ASR system for the Italian language, outperforms SPPAS, suggesting that further work could be done in this direction to obtain even better results.
La presente tesi ha come oggetto lo sviluppo di KIMERA: un sistema di allinea- mento testo-audio per la lingua italiana. Date una registrazione audio contente del parlato e la relativa trascrizione ortografica, l’allineamento consiste nella specifica temporale di quando ogni parola e ogni suo fonema vengono pronunciati all’interno della registrazione. In generale è raro avere a disposizione sia l’audio che la corrispondente trascrizione dato che, trascrivere a mano molte ore di parlato è un lavoro lungo e dispendioso. Una possible soluzione al problema è l’utilizzo di sistemi di Automatic Speech Recognition (ASR), sistemi in grado sia di produrre tali trascrizioni che di allinearle all’audio, usando solamente l’audio stesso come input. La parte fondamentale nello sviluppo di un ASR è la costruzione di un grande ed eterogeneo corpus di registrazioni e trascrizioni per permettere al sistema di avere una rappresentazione di un linguaggio che sia il più generale possibile. Dato che i pochi sistemi che esistono per la lingua italiana sono costruiti su corpora di piccole dimensioni, è stato necessario lo sviluppo di un nuovo sistema di ASR, specifico per questa lingua, allenato su un corpus di dimensioni maggiori rispetto alle soluzioni già esistenti. Il modello presentato in questa tesi, sviluppato all’interno del toolkit di riconosci- mento vocale Kaldi, è allenato su un corpus grande ed eterogeneo, composto da registrazioni audio e trascrizioni provenienti da fonti diverse, in modo da avere una rappresentazione più generale possibile della struttura della lingua italiana. Alcune modifiche sono state fatte su tale corpus per ingrandire il numero di parole che il sistema è in grado di riconoscere e per aumentare la sua robustezza al rumore. L’analisi sperimentale è svolta confrontando gli allineamenti del sistema risul- tante con quelli proposti da SPPAS, unico strumento open-source disponibile per la lingua italiana, anch’esso basato su un ASR che però non è costruito su un corpus grande quanto quello usato per KIMERA. I risultati mostrano che il sistema sviluppato come oggetto di questa tesi, oltre ad essere l’unico ASR con grande vocabolario liberamente disponible per la lingua italiana, offre prestazioni migliori rispetto a SPPAS spronando, quindi, a continuare lo sviluppo in questa direzione.
A tool for automatic text-to-audio alignment : KIMERA
SCANDROGLIO, STEFANO
2016/2017
Abstract
The thesis is aimed at developing KIMERA: a system capable of performing text- to-audio alignment for the Italian language. Given an audio recording containing human speech with its transcription, the alignment is the temporal specification of the instants at which each word and each one of its phonemes are spoken in the recording. The major issue is that rarely a large amount of speech recordings and transcrip- tions is available, since transcribing hours of speech is a long and expensive process. A possible solution to this problem is to use Automatic Speech Recognition (ASR) systems, which are capable of both, providing those transcriptions and aligning them to the audio, using as input audio recordings alone. A fundamental part in the development of an ASR system is collecting a large and heterogeneous corpus of speech recordings, alongside their transcription, to allow the system to build a general representation of a language. Given that the few preexisting systems tailored to the Italian language are developed on small corpora, it was necessary to build a new system, specific for the Italian language, trained on a larger corpus with respect to preexisting solutions. The ASR developed for this thesis is built inside the Kaldi speech recognition toolkit and trained on an heterogeneous corpus, composed of speech recordings and transcriptions coming from several different sources, to allow the system to build a representation of the Italian language that is as general as possible. Additional processing is done on such corpus to increase the number of words that the system is able to recognize and to increase its robustness to noise. The experimental analysis is done by comparing the alignments obtained with the proposed system with those provided by SPPAS, the only existing open-source toolkit for the Italian language, which is likewise based on an ASR system that, however, is not built on a corpus that is as large as the one used to develop KIMERA. Results show that the system developed as the result of this thesis, aside from being the only publicly available large vocabulary ASR system for the Italian language, outperforms SPPAS, suggesting that further work could be done in this direction to obtain even better results.File | Dimensione | Formato | |
---|---|---|---|
2018_04_Scandroglio.pdf
non accessibile
Descrizione: Testo della tesi
Dimensione
2.33 MB
Formato
Adobe PDF
|
2.33 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/140205