Unsupervised domain adaptation for deep learning based acoustic scene classification

Acoustic Scene Classification (ASC) refers to the task of automatically assigning a label to an audio recording that characterizes the environment in which it was captured — for example “Park”, “Café”, “Metro station”. This field of study, in many aspects related to the ones of Automatic Speech Recognition (ASR), Music Information Retrieval (MIR), and Computational Auditory Scene Analysis (CASA), has increasingly gathered the interest of the research community over the past few years, thanks in part also to the recent advancements in deep learning. However, deep learning-based techniques are notoriously sensitive to a problem known as domain shift. This issue, severely impairing the performance of a classifier, arises from a mismatch between data distributions of the datasets utilized for training and evaluating the model. This is all the more true with respect to environmental recordings. In fact, being complex and unstructured superpositions of many diverse sound sources, one might ask whether an ASC system trained to recognize acoustic scenes recorded in certain conditions would be able to classify audio clips captured in different locations and at different times. The everchanging nature of acoustic scenes convinced us of the necessity of focussing on unsupervised techniques, and the novelty of the research topic further justified our study. Indeed, to the best of our knowledge, this work constitutes the second attempt to Unsupervised Domain Adaptation for Acoustic Scene Classification. In this respect, our contribution is threefold: we propose two completely unsupervised methods that are designed to be independent of the underlying learning model, to be irrespective of the amount of available target-domain samples, and not to require a new training phase when applied to a different target domain. We show that both our proposed methods provide higher cross-domain classification accuracies than any non-adapted baseline system, the second one being even capable of outperforming the semi-supervised learning technique which we had regarded as an empirical upper bound for the unsupervised adaptation task.

Acoustic Scene Classification (ASC) si riferisce al compito di assegnare automaticamente a una registrazione audio un'etichetta che caratterizzi l’ambiente fisico dove questa è stata effettuata — ad esempio “Parco”, “Treno”, o “Ristorante”. Questo campo di studi, imparentato per molti aspetti a quelli dell'Automatic Speech Recognition (ASR), del Music Information Retrieval (MIR) e della Computational Auditory Scene Analysis (CASA), ha recentemente attirato l’attenzione della comunità scientifica, anche grazie ai notevoli sviluppi degli ultimi anni nell’ambito del deep learning. Tuttavia, queste tecniche di apprendimento automatico sono notoriamente soggette al cosiddetto domain shift, un problema che sorge a fronte di una discrepanza tra le distribuzioni dei dati utilizzati per allenare e validare i sistemi classificatori, pregiudicandone gravemente le capacità. Ciò è ancor più vero per quanto riguarda le registrazioni ambientali. Infatti, essendo sovrapposizioni complesse e non strutturate di molte sorgenti sonore, ci si potrebbe domandare se un sistema allenato per riconoscere scene acustiche raccolte in determinate condizioni possa essere in grado di classificare correttamente nuove registrazioni effettuate in luoghi e momenti diversi. La natura mutevole delle scene acustiche ci ha spinto a prendere in esame tecniche di apprendimento non supervisionato, e la novità dell’ambito di ricerca, testimoniata dall’estrema limitatezza dello stato dell'arte, ha ulteriormente motivato la nostra indagine. In seguito a una revisione della letteratura, questo nostro lavoro sembra costituire il secondo studio specificatamente dedicato al problema della Unsupervised Domain Adaptation con riferimento alle scene acustiche. A tal proposito, il nostro contributo è triplice: proponiamo infatti due metodi non supervisionati, progettati in modo tale da non richiedere una nuova fase di apprendimento nel caso questi fossero applicati a un diverso dominio, da non essere vincolati al numero di campioni del target-domain, e da essere indipendenti dal sistema di machine learning adottato. Entrambi i sistemi proposti raggiungono risultati migliori rispetto a quelli di riferimento non adattati, mentre il secondo si dimostra addirittura capace di superare la performance dell'algoritmo semi-supervised che, nel corso della nostra trattazione, abbiamo considerato essere una sorta di limite superiore empirico per l'approccio non supervisionato.