SFERAnet : automatic generation of football highlights

The project presented in this thesis was developed with the objective of automating the generation of summaries of sport events, identifying the highlights through the analysis of the commentator voices. In particular, the analysis aimed at taking advantage of the emotion in the voice of commentators, studied through Natural Language Processing techniques, to detect the highlights. To reach this goal, two multimodal Neural Networks have been developed to deal with sequences of features extracted from the audio and text of the match commentaries. The former network was developed with two distinct goals: classify sequences and provide a background for the representation of such sequences to be elaborated by the latter network. In fact, the second network exploits the intermediate representations produced by the former to process hypothetically unbound streams of informations. Both of the proposed Neural Networks take advantage of the components that have proved to be particularly helpful in this area in the last years; in particular, among these, the LSTM recurrent layers represent a fundamental tool to analyse sequential data. In the presented work, raw audio and textual data extracted by 369 football matches provided the source for feature extraction. Such features have been employed to train two Neural Networks: for the former network, feature streams were split in sub-sequences at (nearly) sentence level, while for the latter the entire streams were employed. The final results turned out to be very promising; however, the models still require further tuning and improvement to be adopted in a real-world video pipeline.

Il progetto presentato in questa tesi è stato sviluppato con l’obiettivo di automatizzare la generazione dei riassunti di eventi sportivi, identificando i momenti salienti tramite l’analisi della voce dei commentatori. In particolare, l’analisi è mirata a sfruttare l’emozione nella voce dei commentatori, che viene studiata tramite tecniche di elaborazione del linguaggio naturale, allo scopo di ricercare i suddetti momenti salienti. Per raggiungere questo obiettivo sono state sviluppate due reti neurali multimodali per lavorare su sequenze di feature estratte dall’audio e dal testo dei commenti delle partite. La prima rete sviluppata è stata realizzata con due scopi differenti: classificare sequenze e fornire una base per la rappresentazione di queste sequenze da elaborare nella seconda rete. Infatti, la seconda rete neurale sfrutta le rappresentazioni intermedie generate dalla prima per processare sequenze di informazioni ipoteticamente illimitate. Entrambe le reti neurali proposte sfruttano componenti che hanno dimostrato una particolare utilità in quest’ambito in questi ultimi anni; in particolare, tra questi, gli strati ricorrenti di tipo LSTM rappresentano uno strumento fondamentale per l’analisi di dati in sequenza. Nel lavoro presentato, i dati grezzi di tipo audio e testuale estratti da 369 partite di calcio hanno costituito la fonte da cui estrarre le feature. Queste feature sono state impiegate per addestrare le due reti neurali: per la prima rete le sequenze di feature sono state spezzate (circa) a livello di frase, mentre per l’altra rete sono state utilizzate le sequenze intere. I risultati finali si sono dimostrati molto promettenti; ad ogno modo, i modelli hanno ancora bisogno di essere messi a punto e migliorati per poter essere introdotti in una pipeline video reale.