Blind source separation (BSS) of multichannel music recordings with long short-term memory (LSTM) recurrent neural networks

This thesis implements Long Short-Term Memory (LSTM) recurrent neural networks to perform Blind Source Separation (BSS) of multichannel, professionally produced music recordings. Music BSS can be useful for applications such as music editing, upmixing, music information retrieval and karaoke and is subject of ongoing research. As very good results in the music BSS task have been recently achieved with Feedforward Neural Networks (FNN), this thesis investigates whether the LSTM architecture can benefit from its memory and improve the performances with respect to FNN architectures. With this target, four LSTM and two FNN networks have been trained in various configurations but analogous conditions, to allow a good comparison of their performances. The results are divided in two parts: raw results and post-processed results. The raw results are used to compare the performances of the LSTM and FNN architectures, without the influence of any post-processing computation. These show LSTM networks performing better separations than FNN networks, in particular for target sources with repetitive patterns or slow variations in the spectrogram, in which the LSTM memory can be exploited more extensively. The post-processed results are used to put the performances of the networks in a bigger context, comparing them with the last research methods presented at the Signal Separation Evaluation Campaign (SiSEC) 2015. These show the implemented LSTM and FNN networks performing better than any non-neural-network method proposed at the SiSEC campaign, in line with the results of the other neural networks proposed, thus confirming neural networks as a prominent method for the task.

Questa tesi implementa reti neurali ricorrenti di tipo Long Short-Term Memory (LSTM) per la separazione cieca di sorgenti (Blind Source Separation, BSS) audio multicanale, nello specifico di registrazioni musicali professionali. La BSS di segnali musicali trova applicazione in svariati ambiti, e.g. editing musicale, upmixing, Music Information Retrieval e karaoke, ed è tutt’oggi oggetto di ricerche. Negli ultimi anni, i migliori risultati sulla BSS sono stati ottenuti con l’implementazione di reti neurali a propagazione in avanti (Feedforward Neural Network, FNN). Questa tesi si propone di esplorare le potenzialità delle reti LSTM e, in particolare, se e quanto il loro meccanismo di memoria interna può essere utile per la qualità delle BSS di registrazioni musicali multicanale. Per questo motivo sono state implementate reti di tipo FNN (senza alcun meccanismo di memoria) e di tipo LSTM, in diverse configurazioni ma in condizioni simili, con l’obiettivo di confrontarne le prestazioni. Due categorie di risultati sono stati presentati: i risultati diretti ed i risultati processati. I risultati diretti provengono dalla valutazione diretta delle estrazioni delle reti e sono utilizzati per comparare le prestazioni delle due architetture, LSTM e FNN, in modo diretto e libero da influenze esterne quali filtraggi a posteriori delle estrazioni. I risultati diretti mostrano chiaramente come le reti LSTM abbiano ottenuto migliori prestazioni rispetto alle reti FNN, in particolare per la separazione di strumenti il cui spettrogramma è caratterizzato da poche variazioni o da sezioni ripetitive, dove le reti LSTM possono sfruttare la loro memoria in modo più ampio ed efficace. I risultati processati, invece, vengono utilizzati per mettere le prestazioni delle reti implementate in una prospettiva più ampia, comparandoli con i risultati delle più recenti ricerche scientifiche nel campo della BSS presentate al Signal Separation Evaluation Campaign (SiSEC) 2015. Questi risultati vedono le reti implementate ottenere migliori estrazioni di tutti i metodi presentati al SiSEC diversi da reti neurali, in linea con i risultati delle altre reti neurali proposte, confermandone l’importanza nell’ambito delle BSS.