Deep learning methods for sound-matching in semi-modular synthesizer environments

Audio synthesizers are devices able to manipulate and produce electronic sounds from scratch. In the last decades they defined completely new forms of music and aesthetics worldwide. However, the complexity given by their big number of parameters requires knowledge and domain expertise. Tuning these parameters manually in order to match a specific sound target can be a very challenging and time consuming task. This thesis investigates the sound-matching problem applied to a semi-modular synthesizer environment through a variety of Deep Learning techniques. For this purpose we used the transformers, a novel architecture recently introduced by Vaswani et al. [1], and other deep learning models such as the Multi-Layer Perceptron (MLP), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (LSTM++) and Convolutional Neural Network (CNN). The sound-matching problem, in the field of Automatic Synthesizer Programming (ASP), aims at predicting the configuration of parameters needed to generate a target sound from a synthesizer. In the literature all the researches employing deep learning methods for the sound matching problem were applied only on traditional, non-modular synthesizer architectures. Moreover, we are introducing transformers for the first time to solve this problem. Semi-modular synthesizers are considered one of most complex electronic instruments able to produce a very wide range of sonic timbres. Transformers, instead, represent the state-of-the-art in sequence models, usually outperforming Recurrent Neural Network models on Natural Language Processing or Computer Vision tasks. We found out in our research that Transformers are a powerful but quite unstable architecture, whose performance can be improved by increasing the amount of data and by accurately fine-tuning its hyperparameters. The networks’ reconstruction accuracy of a given target was evaluated on both the parametric and the spectral domain. In our study, LSTM was the best performing architecture for spectral reconstruction. It was also the best model on the most complex dataset for parametric inference precision. Transformers were in our study the second best performing model on the same complex dataset. Summarizing, this research brought a new benchmark in the field of Automatic Synthesizer Programming. We first introduced a completely new dataset entirely generated by a semi-modular synthesizer environment. Then, we showed the results of the sound-matching problem applying a variety of Deep Learning techniques. We believe that future developments of this research might go towards a differentiable model for the synthesis. In this way we could retain spectral information of the reconstructed audio in the backpropagation as well. Moreover, a listening test could also be a very good metrics for validating or improving the numerical results. We also plan to incorporate this thesis work in a Digital Audio Workstation (DAW) to allow many more people to experiment with deep learning models and find innovative directions in sound exploration and design.

I sintetizzatori (abbreviato synth) sono degli strumenti musicali in grado di produrre e manipolare timbri elettronici. Negli ultimi decenni sono stati in grado di portare alla luce generi musicali ed estetiche innovative in tutto il mondo. Tuttavia, il livello di complessità richiesto per governarli è spesso alto e dipende dal gran numero di parametri utilizzati per controllarli. Saper configurarli manualmente per riprodurre uno specifico suono o "target" è un compito certamente complesso e costoso in termini di tempo per moltissimi utenti, amatoriali e non. Questa tesi indaga su un problema chiamato sound-matching attraverso l’utilizzo di Reti Neurali Artificiali, ed in particolare con tecniche di Deep Learning. Per risolverlo abbiamo impiegato i transformers, una nuova architettura recentemente introdotta da Vaswani et al. [1], ed altri modelli di Deep Learning come il Multi-Layer Perceptron, Long-Short Term Memory, Bidirectional Long-Short Term Memory ed i Convolutional Neural Networks. Il problema di sound-matching, nell’ambito del Automatic Synthesizer Programming (ASP), mira a predire la configurazione dei parametri del sintetizzatore necessaria per generare un determinato suono target. Nella letteratura tutte le ricerche che hanno impiegato metodi di Deep Learning per il problema del sound-matching lo hanno fatto in sistemi di sintetizzatori non modulari, bensì tradizionali. Inoltre, introduciamo per la prima volta i transformers per risolvere la suddetta problematica. I sistemi semi-modulari sono considerati al giorno d’oggi alcuni degli strumenti elettronici più complessi e versatili dal punto di vista timbrico. I transformers, inoltre, rappresentano oggi l’avanguardia nell’ambito dei modelli sequenziali. Infatti, essi sono riusciti in molti casi ad avere migliori prestazioni dei Recurrent Neural Networks (RNNs) nell’ambito dell’Elaborazione del Linguaggio Naturale o in quello della Visione Artificiale. Nella nostra ricerca abbiamo scoperto che i transformers sono uno strumento potente ma talvolta instabile, le cui prestazioni migliorano aumentando la dimensione del dataset oppure aggiustando accuratamente i loro iperparametri. L’accuratezza dei risultati è stata valutata sia in termini di precisione dei parametri predetti rispetto al "target", sia relativamente alla vicinanza spettrale del suono generato rispetto quello desiderato. Nella nostra ricerca, LSTM è stato il modello più performante per la ricostruzione spettrale. È stato anche il miglior modello rispetto alla predizione dei parametri sul dataset più complesso. I transformers sono stati nella nostra ricerca il secondo migliore modello per risolvere il problema del sound-matching nello stesso dataset più complesso. Riassumendo, questa ricerca ha introdotto un nuovo standard nell’ambito dell’Automatic Synthesizer Programming. In primo luogo, abbiamo presentato un nuovo dataset generato unicamente da un sintetizzatore semi-modulare. Poi, abbiamo mostrato i risultati di diverse reti, tra cui i transformers, per risolvere il problema di sound-matching. Pensiamo che alcuni dei possibili sviluppi di questa ricerca nel futuro potrebbero andare nella direzione di modelli differenziabili utilizzati per la sintesi. In questo modo la rete neurale sarebbe in grado di tenere conto nel processo di backpropagation non solo dei parametri, ma anche dell’informazione spettrale. Inoltre, pensiamo che un test uditivo dei segnali predetti rispetto al target possa essere un’ulteriore metrica su cui poter validare o migliorare i risultati fin’ora registrati. Intendiamo inoltre incorporare questo lavoro di tesi all’interno di una DAW, permettendo così a più utenti di sperimentare modelli di deep learning e di aprire la strada a direzioni innovative per il sound design e l’esplorazione sonora.