Convolutional neural networks applied to space-time audio processing applications

The problem of soundfield reconstruction has been of crucial interest to the research community for decades, due to its various application domains. The goal of this thesis is the reconstruction of the soundfield of a room from signals acquired through distributed single microphones, deployed randomly in the considered environment. The representation that we choose in order to reconstruct the soundfield is the one given by the Ray Space Transform, which enables us to map the information acquired by microphone arrays onto a domain called ray space. In particular, a point in this domain corresponds to a ray in the geometric space, i.e. the oriented line along which the acoustic energy is transported. Physical limitations imposed by the Shannon theorem on the spacing between microphones, needed in order to avoid spatial aliasing, make the problem intractable for very large environments, which would require huge microphone setups. We therefore resort to an approach that combines space-time audio processing techniques with a machine learning paradigm called deep learning. In particular, we utilize deep convolutional neural networks in order to reconstruct the images obtained by computing the magnitude of the Ray Space Transform. We approach the problem by gradually increasing the complexity of the reconstruction task. First, we test the capability of convolutional neural networks in performing acoustic source localization. We devise a classification-based approach, using measurements acquired through microphones randomly scattered in a room. Then, we treat the problem of reconstructing the Ray Space Magnitude, in the case where the arrays used for the acquisition have a number of randomly chosen missing microphones. In order to solve this problem we devise an autoencoder architecture, able to reconstruct the information that would have been acquired using arrays where all the microphones are present. Finally, we consider the problem of the reconstruction of the magnitude of the Ray Space Transform from measurements acquired through randomly distributed single microphones. We do this by training a double input encoder/decoder convolutional architecture. We provide simulations, showing the ability of our models in the soundfield reconstruction task. The results of this work can be seen as an exploratory step towards the possibilities of combining the ray space representation with machine learning techniques and can lead to applications to more complex scenarios.

Il problema della ricostruzione del campo acustico cattura l'interesse della comunità scientifica da decenni, date le sue svariate applicazioni. Lo scopo di questa tesi è la ricostruzione del campo acustico di un ambiente tramite microfoni distribuiti casualmente. La rappresentazione che scegliamo per il campo acustico è quella data dalla trasformata ray space, che permette di mappare nel dominio detto spazio dei raggi l'informazione acquisita tramite schiere di microfoni. In particolare, punti in tale dominio corrispondono a raggi nello spazio geometrico, linee orientate lungo cui viene trasportata l'energia acustica. Limitazioni fisiche imposteci dal teorema di Shannon, riguardo allo spaziamento tra microfoni richiesto per evitare aliasing spaziale, rendono il problema intrattabile nel caso di ambienti grandi, che richiederebbero enormi configurazioni di microfoni. Per questi motivi scegliamo di utilizzare un approccio che combini l'elaborazione spazio-tempo del suono con tecniche di apprendimento profondo. In particolare, utilizzando reti neurali convoluzionali profonde ricostruiamo le immagini ottenute tramite il calcolo del modulo della trasformata ray space. Nel nostro approccio, scegliamo di aumentare gradualmente la complessità della ricostruzione. Per prima cosa, testiamo le capacità delle reti convoluzionali nell'affrontare la localizzazione di sorgenti acustiche. A tal scopo, sviluppiamo un classificatore che utilizza misurazioni acquisite tramite microfoni posizionati casualmente. In seguito, affrontiamo la ricostruzione del modulo della trasformata ray space nel caso in cui le schiere utilizzate per l'acquisizione dei segnali abbiano un numero casuale di microfoni mancanti. A questo scopo, sviluppiamo un'architettura di tipo autoencoder, la quale ci permette di ricostruire l'informazione che sarebbe stata ottenuta utilizzando schiere dove tutti i microfoni sono presenti. Infine, affrontiamo la ricostruzione del modulo della trasformata ray space tramite misurazioni acquisite mediante microfoni distribuiti casualmente. Facciamo questo allenando una architettura convoluzionale di tipo encoder/decoder a due input. Per dimostrare l'efficacia dei nostri modelli, forniamo risultati ottenuti tramite simulazioni. I metodi sviluppati in questa tesi possono essere considerati come una prima esplorazione delle possibilità che si potrebbero sviluppare dalla combinazione della trasformata ray space con tecniche di apprendimento automatico e aprono la strada all'applicazione a scenari più complessi.