Development of a real-time artistic installation for enhancing music performances by generating evolving visuals from audio input

Nowadays, performances dominate several aspects of the music industry. It is crucial for musicians to promote and share their art with an audience. The evolution of live shows is mostly determined by the advent of new instruments and technologies that are able to expand creative and artistic possibilities. In particular, the advent of electronic and digital devices has paved the way for the introduction of new art forms inside the context of a musical performance, such as light effects or visual content. This thesis aims to visually represent a song played in a context of a live show or an artistic installation where music is involved. The proposed method works in real-time, starting from two inputs, the microphone and the instrumental, and returning a visual content that evolves coherently with the song input. The artworks’ content is guided by the lyrics, and the colour palette is set by the instrumentals. In order to achieve the final result, several Machine Learning techniques are explored and used. The proposed method involves several components that are designed to resolve tasks like Automatic Speech Recognition, Music Genre Classification and Real-Time Image Synthesis. The system works with two sub-processing methods that works in parallel. The microphone’s input is transcribed in real-time by a component based on the Whisper architecture. The instrumental is passed to a classifier based on the Discogs-EffNet, that outputs five most related genre to the audio under analysis. This genres are mapped to colour in order to create the artwork’s colour palette. The transcribed text and the colour palette are then combined inside a prompt to generate the final images in real-time through the use of a generative model trained with Adversarial Diffusion Distillation approach. In order to have correlation between frames, text embeddings are extracted and interpolated. The proposed method is evaluated by presenting different techniques for inference and optimization. The performances’ results are then compared in order to find the best solution for the application of this method in a real environment. A prototype of the system was presented to an audience and feedback are collected and discussed.

Al giorno d’oggi, le live performances dominano diversi aspetti dell’industria musicale. Per i musicisti è cruciale promuovere e condividere la propria arte con un pubblico. L’evoluzione di live shows è determinata dall’avvento di nuovi strumenti musicali e tecnologie che permettono agli artisti di espandere le proprie possibilità creative. In particolare, l’introduzione di sistemi digitali ha aperto la strada all’utilizzo di nuove forme d’arte all’interno di performance musicali, come effetti di luce o video content. Questo lavoro di tesi ha lo scopo di rappresentare visivamente una canzone performata in un concerto dal vivo o di un’installazione artistica nella quale è presente una performance musicale. Il metodo proposto opera in real-time, partendo da due input, il microfono e la strumentale, e restituendo un contenuto visual che evolve basandosi sulla canzone performata. Il contenuto dell’artwork è guidato dal testo cantato, la palette colori è determinata dalla strumentale. Il risultato è raggiunto attraverso l’utilizzo di diverse tecniche di Machine Learning. Il metodo proposto è composto da una serie di componenti progettati per assolvere diversi compiti come Automatic Speech Recognition, Music Genre Classification e Real-Time Image Synthesis. Il sistema è composto da due sotto processi che lavorano in parallelo. L’input del microfono è trascritto in real-time da un componente basato sull’ architettura del modello Whisper. L’audio della strumentale è dato in input ad un classifcatore basato su Discogs-EffNet, il quale restituisce i 5 generi più associabili all’audio sotto analisi. Successivamente, ogni genere è mappato ad un colore, per ottenere una palette. Il testo trascritto e la colour palette sono concatenati all’interno di un prompt, per la generazione delle immagini in real-time, con modelli allenati attraverso Adversarial Diffusion Distillation. Per ottenere correlazioni tra le immagini, il testo è codificato e i risultanti embeddings sono interpolati. La valutazione del metodo proposto è stata fatta presentando e confrontando diverse tecniche d’inferenza e di ottimizzazione, per arrivare alla miglior soluzione possibile. Infine, un prototipo del sistema è stato presentato ad un pubblico e i risultati dei feedback sono discussi.