Automatic genre classification of TV programs using L3-based deep audio features

Television transmission can often be differentiated exhaustively by their genre. Retrieving this attribute in an automated fashion is particularly beneficial for large multimedia catalogues, where efficient content management is needed. Tipically, TV genres are characterized with articulate and subjective definitions and some of them prove to be more similar than others, under various aspects. Therefore annotation tools require rich and detailed information to fulfill the task correctly and can exploit various modalities to extract useful information. Between the different modalities, audio is probably the most simple to deal with and compared to vision and text, also the one that has been less studied on this task. For this reason, this manuscript investigates audio-based Automatic Genre Classification tools. Since datasets for this task are not publicly available, we started by creating a new dataset of italian TV programs, called ITTV Dataset, which comprises nearly 700 hours of contents. Next, we investigated the performance of end-to-end Neural Networks, which to the best of our knowledge, have never been studied on this task. The obtained results motivated the need for more complex architectures, like the one presented by (Pham et al., 2021), which currently is the state-of-the-art for this task. We noticed however, that the approach proposed by this architecture is not safe in the case of genres that show similar acoustic signatures. Therefore, we designed a new multi-stage architecture, which leverages visually informed audio features, thanks to the Look, Listen and Learn (L3) embedding extractor, introduced by (Arandjelovic et al., 2017). This new architecture overcomes the limits of (Pham et al., 2021) and obtains state-of-the-art results on our dataset.

Le trasmissioni televisive possono spesso essere differenziate in un modo esaustivo tramite il loro genere. Ricavare questo tipo di attributo in modo automatico è particolarmente utile per i grandi cataloghi multimediali, dove è necessaria una gestione efficiente dei contenuti. Tipicamente, i generi televisivi sono caratterizzati da definizioni articolate e soggettive e alcuni di essi si dimostrano essere simili tra loro. Per questo motivo gli strumenti di annotazione richiedono un livello di informazione ricco e dettagliato per svolgere il loro compito correttamente. Tra le differenti modalità, l’audio è probabilmente quella più semplice da trattare e rispetto alla modalità visiva e al testo, risulta essere una delle meno studiate per questo problema. Per questo motivo, questo manoscritto investiga gli strumenti di Classificazione Automatica del Genere basata sul solo audio. Dal momento che i dataset disponibili per lo studio di questo problema non sono pubblicamente accessibili, inizialmente abbiamo creato uno nuovo dataset di trasmissioni televisive italiane, chiamato ITTV Dataset, che comprende circa 700 ore di contenuti. Successivamente, abbiamo studiato la performance di reti neurali endto- end, che al meglio delle nostre conoscenze non sono mai state studiate per questo problema. I risultati ottenuti hanno motivato la necessità di architetture più complesse, come quella proposta da (Pham et al., 2021), che attualmente è lo stato dell’arte per questo task. Tuttavia abbiamo notato che questa architettura non è sicura nel caso di generi televisivi che mostrano una impronta acustica simile. Per questo motivo, abbiamo progettato una nuova architettura che sfrutta feature audio contenenti informazione visiva, grazie al estrattore di feature denominato Look, Listen and Learn, introdotto da (Arandjelovic et al., 2017). Questa nuova architettura supera i limiti di quella proposta da (Pham et al. 2021) e ottiene risultati allo stato dell’arte sul nostro dataset.