Exploring speech emotion recognition out of the laboratory

As the study of the emotional content of speech signals has gotten more attention recently, various methods have been suggested to identify the emotional content of a spoken utterance. Speech Emotion Recognition (SER) is the research field that deals with processing and classifying speech signals to extract the emotional state of a speaker. In this thesis we analyzed the performances of humans and of some machine learning models in classifying the emotions embedded in speech samples in order to try to answer two research questions: which machine learning algorithm performs best among KNN, GMM and SVM for the speech emotion recognition task? Is a machine learning model for speech emotion recognition trained on laboratory clean data effective also in a wilder context? Our work was carried out training and testing three different classifiers (KNN, GMM and SVM) and keeping trace of humans' performances on two datasets with different characteristics: EMOVO and Emozionalmente. The first one was built in a laboratory with professional actors while the other one was gathered through a crowdsourcing campaign in a real-world, wild context with ordinary people recording the utterances. We also perfomed a cross-corpus experiment training the models on one dataset and testing it on the other one and viceversa. The results obtained are consistent with the literature in the EMOVO case, while there was a drop in the performances for what concerns Emozionalmente. Observing the data, we concluded that this is probably due to the fact that this corpus has a lot more noise. The acting capabilities of the speakers are probably another important factor. For what concerns the cross-corpus experiment, we obtained a larger deterioration of the classifiers' performances. Indeed, we also observed that the two datasets are actually almost completely distinct and separable one from the other. This explains the poor results and suggests that mixing the datasets in this way requires further analysis.

Lo studio del contenuto emotivo di segnali audio vocali ha ricevuto più attenzione recentemente e vari metodi sono stati suggeriti per identificarlo. Speech Emotion Recognition (SER) è il campo di ricerca che si occupa di elaborare e classificare i segnali audio vocali per estrarre lo stato emotivo di chi parla. In questa tesi abbiamo analizzato le prestazioni degli esseri umani e di alcuni modelli di machine learning nel classificare le emozioni incorporate in audio vocali per provare a rispondere a due domande di ricerca: quale algoritmo di machine learning ha prestazioni migliori tra KNN, GMM e SVM per il compito di speech emotion recognition? Un modello di machine learning per fare speech emotion recognition il cui training è eseguito su dati puliti di laboratorio è efficace anche in un contesto più simile al mondo reale? Abbiamo svolto training e testing di tre diversi classificatori (KNN, GMM e SVM) e abbiamo tenuto traccia delle prestazioni delle persone su due datasets con caratteristiche diverse: EMOVO ed Emozionalmente. Il primo è stato costruito in laboratorio con attori professionisti, mentre il secondo è stato raccolto attraverso una campagna di crowdsourcing in un contesto più simile al mondo reale e con persone senza background attoriale. Abbiamo anche eseguito un esperimento cross-corpus eseguendo il training dei modelli su un dataset, testandoli sull'altro e viceversa. I risultati ottenuti sono coerenti con la letteratura nel caso di EMOVO, mentre si registra un calo delle prestazioni per quanto riguarda Emozionalmente. Osservando i dati, abbiamo concluso che ciò è probabilmente dovuto al fatto che questo corpus ha molto più rumore. Le capacità recitative degli attori sono probabilmente un altro fattore rilevante. Per quanto riguarda l'esperimento cross-corpus, abbiamo ottenuto un peggioramento maggiore delle prestazioni dei classificatori. Infatti, abbiamo anche osservato che i due dataset sono in realtà quasi completamente distinti e separabili l'uno dall'altro. Questo spiega gli scarsi risultati e suggerisce che mescolare i datasets in questo modo richiede un'analisi più approfondita.