PESInet : prosody extraction by sound interpreting network

The work we are presenting here is part of the project called LYV, the purpose of which is to help children with prosody pronunciation disorders to recover their prosodic capabilities in order to be able to correctly express themselves. Our studies are strictly correlated to technical and implementations matters and in particular we went through a preliminary development of a system capable of automatically classifying sentences by their prosody. It could be considered a first time ever that someone attempted to put together an automatic classifier of this kind, especially because of the technique we decide to use. We based our approach upon the booming neural network technology and after a long period of studies we decided to go after its RNN declination. Neural networks are a paradigm largely used in Machine Learning and Artificial intelligence tasks, by means of which data scientists try to simulate the inner workings of an actual brain. Through these mathematical models, generally speaking, we can solve a new range of problems which require a different approach from the usual programming paradigms. Among these tasks, we have of course that of classification, and in particular the one of sequence labelling. The motivations behind the choice of RNNs are mostly of practical nature: first of all in this way we basically circumnavigate the problem of coming up with a precise set of high level features to identify prosody, delegating this task to the network. Then we have the encouraging developments and results that these models are showing in relation to sequence analysis. By modelling our sentences into sequences of features we therefore hope to leverage these achievements. One of the main problems we had immediately to deal with, was choosing a classification model. The issue comes from the fact that even if there are many models described in literature, none of them can be considered neither a cornerstone nor a reference for prosody. These models are mostly written case by case, and the classification is done following linguists knowledge, because of this, they result severely bounded to researchers judgement. Given this and many other issues that we had to face during our research, we finally decided to go for a very basic and simple classification, distinguishing among: Questions, Exclamations and Statements. During a long period of time we tested several different RNNs architectures, leading to a very spread range of different results. These models will be analysed throughout the chapters of this thesis. Among the related works, we will also mention and describe the tortuous path we had to follow to came up with our dataset and all the decisions that moved us towards the classification model chosen.

Il lavoro svolto fa parte del progetto LYV, il cui scopo è quello di aiutare bambini con disturbi di pronuncia a migliorare le proprie capacità espressive. Questo studio è soprattutto incentrato sugli aspetti tecnici ed implementativi di un modello in grado di classificare automaticamente le frasi pronunciate in base alla loro prosodia. Può essere considerato come il primo tentativo di utilizzare la tecnologia delle reti neurali per costruire un classificatore in questo campo. Le reti neurali sono molto utilizzate in ambito Machine Learning ed Intelligenza Artificiale; sono algoritmi che tentano in qualche modo di simulare il comportamento interno di un cervello umano. Attraverso questi modelli matematici è possibile risolvere un insieme di problemi che richiedono un approccio diverso rispetto ai normali paradigmi di programmazione. Tra questi problemi si hanno quello della classificazione e, più nello specifico, quello della classificazione di sequenze. La motivazione che ha portato a scegliere le RNN è stata prettamente di natura pratica: prima di tutto ha consentito di non dover definire un preciso insieme di features caratterizzanti la prosodia, lasciando questo compito alla rete. In secondo luogo è stata scelta per gli incoraggianti risultati che questi modelli stanno mostrando nell'ambito dell'analisi di sequenze. Trasformando le frasi in sequenze di features si spera di riuscire a sfruttare i risultati conseguiti in questo campo. Uno dei primi problemi affrontati è stato la scelta del modello di classificazione. In letteratura sono descritte un gran numero di architetture ma nessuna può essere considerata né un riferimento né una pietra miliare nell'ambito della prosodia. Spesso questi modelli sono costruiti su casi specifici e i risultati sono particolarmente legati al giudizio dei linguisti che eseguono la classificazione. Ciò detto, si è deciso di utilizzare un insieme di classi molto semplice da definire: Domande, Esclamazioni e Affermazioni. Nel corso di un lungo periodo di tempo molte diverse architetture RNN sono state testate, portando a un numero di risultati molto vasto. Questi modelli sono stati analizzati lungo i capitoli di questa tesi. Tra i lavori correlati verrà riportato e descritto il percorso seguito per costruire il dataset così come tutte le decisioni prese nella scelta del modello finale.