Emotional speech data collection via crowdsourcing

Many researches in various disciplines (e.g. psychology, computational linguistics, human-computer interaction, artificial intelligence) have investigated human emotions, their expression and their perception, and have developed theoretical methods, computational techniques and technologies (typically based on Machine Learning - ML) to identify emotions encoded in bodily manifestations, for example facial and vocal ones. The main theme of this thesis is the identification of emotions expressed in speech, an area known as Speech Emotion Recognition (SER). An intrinsic difficulty in this sector is that of having adequate vocal expression datasets on the basis of which to define and test appropriate ML techniques. Until now, most research has used datasets created by hiring actors and using professional recording equipment. This approach allows to obtain high quality data in terms of audio and emotional expressiveness, but requires a complex and expensive process. This thesis explores the feasibility and effectiveness of creating datasets through crowd-sourcing methods. For this purpose, "Emozionalmente" was developed, a crowdsourcing web application that allows to involve "ordinary people", record and "tag" voice expressions through personal mobile devices. Machine learning (ML) techniques were used to perform data analysis and recognition of the emotions expressed by vocal sentences. Various models have been developed on the basis of both the data collected with Emozionalmente and those available in an existing Italian dataset of phrases recited by actors (EMOVO). An attempt was therefore made to establish whether the data collected through crowd-sourcing was actually usable in an SER process, that is, whether they provided results of a quality similar to that obtainable with the dataset created by actors. The various models were compared in terms of accuracy, precision, recall, MCC and f1. Finally, to complete the comparison, the ability to recognize emotions performed by humans with respect to SER techniques based on the dataset created via crowdsourcing and the EMOVO dataset was assessed. This document presents the development process of the web application Emozionalmente, the method of data collection through this application, and the method of analysis and comparison of the obtained data, and discusses the results obtained and the prospects for future research.

Molte ricerche in varie discipline (e.g. psicologia, linguistica computazionale, human-computer interaction, intelligenza artificiale) hanno investigato le emozioni umane, la loro espressione e la loro percezione, e hanno sviluppato metodi teorici, tecniche computazionali e tecnologie (tipicamente basate su Machine Learning - ML) per identificare le emozioni codiﬁcate nelle manifestazioni corporee, ad esempio quelle facciali e vocali. Il tema principale di questa tesi è la identificazione delle emozioni espresse nel parlato, un settore noto come Speech Emotion Recognition (SER). Una intrinseca difficoltà in questo settore è quella di disporre di dataset adeguati di espressioni vocali sulla base dei quali definire e testare opportune tecniche di ML. Fino ad ora la maggior parte della ricerca ha utilizzato dataset creati ingaggiando attori e utilizzando equipaggiamento professionale di registrazione. Questo approccio permette di ottenere dati di alta qualità in termini di audio ed espressività emotiva, ma richiede un processo complesso e costoso. Questa tesi esplora la fattibilità e l’efficacia della creazione di dataset attraverso metodi di crowd-sourcing. A questo scopo, è stata sviluppata “Emozionalmente”, una applicazione web di crowdsourcing che permette di coinvolgere “persone comuni” e registrare e “taggare” le proprie espressioni vocali attraverso dispositivi mobili personali. Per effettuare l’analisi dei dati e il riconoscimento delle emozioni espresse dalle frasi vocali sono state utilizzate tecniche di Machine Learning (ML). Sono stati sviluppati vari modelli sulla base sia dei dati raccolti con Emozionalmente sia di quelli disponibili in un database italiano esistente di frasi recitate da attori (EMOVO). Si è quindi cercato di stabilire se i dati raccolti attraverso crowd-sourcing fossero effettivamente utilizzabili in un processo di SER, cioè se fornissero risultati di qualità simile a quelli ottenibili con il dataset creato da attori. I vari modelli sono stati comparati in termini di accuratezza, precisione, recall, MCC e f1. Infine, per completare la comparazione, è stata valutata la capacità di riconoscimento delle emozioni eseguita da esseri umani rispetto alle tecniche di SER basate rispettivamente sul dataset creato via crowdsourcing e sul dataset EMOVO. Questo documento presenta il processo di sviluppo dell’applicazione web Emozionalmente, la modalità di raccolta dati mediante questa applicazione, e il metodo di analisi e comparazione dei dati raccolti, e discute i risultati ottenuti e le prospettive di ricerca futura.