PATHOSnet : parallel, audio-textual, hybrid organization for sentiment network

Voice, facial expressions, gestures are all verbal and non-verbal channel used to communicate a meaning and an emotional state in a very natural but complicate way. Speech is one of the main channel for expressing emotions, and for a natural human-machine interface, it is important to recognize, interpret and respond to the emotions communicated in speech. Different approaches using machine learning and neural networks have been studied by scientist to recognize emotion from speech or text only, but rarely they combine speech and text transcription informations. In this work we study the problematic of emotion recognition combining speech and text information being classified by a Recurrent Neural Network that other a perfect tool for this task. The purpose of this thesis is to build a system capable of recognizing emotion combining audio and text speech informations, and show that this approach outperform systems considering single audio or text informations. The approach is evaluated on the IEMOCAP corpus, that offer realistic audio recording and transcription of sentences with emotion content. The IEMOCAP corpus has been modified and adapted to consider four emotions classes (Joy, Anger, Sadness, Neutral) in order to meet the needs of the models implement. First two distinct models are built and evaluated treating single text and audio informations in a separated way and obtain respectivelly an overall accuracy of 62.3% and 55.5%. Next our final model that combine text and audio informations as one is built, evaluated and will show some big improvement in recognizing emotion reaching an overall accuracy of 69.3% from the single audio or text model.

Voce, espressioni facciali, gesti sono tutti canali verbali e non verbali utilizzati per comunicare un significato e uno stato emotivo in modalità naturali ma complesse. Il linguaggio è uno dei canali principali per esprimere le emozioni e, per un'interfaccia uomo-macchina, è importante riconoscere, interpretare e rispondere alle emozioni comunicate dall’utente. Diversi approcci che utilizzano l'apprendimento automatico e le reti neurali sono stati studiati per riconoscere le emozioni principalmente solo dal parlato o dal testo, ma raramente combinano entrambi le informazioni provenienti dai due canali comunicativi. In questo lavoro studiamo la problematica del riconoscimento delle emozioni combinando le informazioni vocali e testuali classificate da una rete neurale ricorrente, strumento perfetto per questo compito. Lo scopo di questa tesi è quello di costruire un sistema in grado di riconoscere l'emozione combinando informazioni vocali e audio e mostrare che questo approccio supera i risultati di prestazione dei sistemi che considerano singole informazioni audio o di testo. L'approccio è valutato sul corpus IEMOCAP, che offre registrazioni audio realistiche e corrispondenti trascrizioni di frasi con contenuto emotivo. Il corpus IEMOCAP è stato modificato e adattato per considerare quattro classi di emozioni (Gioia, Rabbia, Tristezza, Neutro) al fine di soddisfare le esigenze dell'attuazione dei modelli. I primi due modelli sono costruiti e valutati trattando informazioni di testo e audio in modo separato, ottenendo una precisione complessiva del 62,3% (modello solo testo) e del 55,5% (modello solo audio). Il modello finale che combina il testo e le informazioni audio mostrerà grandi miglioramenti, rispetto ai primi due modelli, nel riconoscere l'emozione, raggiungendo una precisione complessiva del 69,3%.