Recently, a large number of models based on deep learning have been proposed for the generation of music. A particular and efficient application is the controllable generation of high-quality audio by manipulation of audio tokens. However, inadequate attention has been given to the structural organization that distinguishes musical pieces. This has resulted in a dearth of original and genuine musical generation, as well as a lack of effective content-based creative frameworks. Moreover, the lack of strongly annotated acoustic datasets leads to relying on a very generic extraction of embeddings that doesn't have semantic meaning or even to the exclusive use of MIDI datasets. This thesis work presents Audio909, a multi-track acoustic dataset of 909 piano performances. Furthermore, this work uses a subset of Audio909 to train a transformer with structure positional encoding for music generation that exploits musical annotations for a novel structure-informed positional encoding method. The Audio909 dataset was obtained by driving a self-playing upright piano with a pre-existing MIDI dataset. Each recording was captured using five different recording techniques, including monophonic and stereophonic configurations, as well as varying degrees of room reverberation. The subset of recordings employed for the generation of music was synchronized with its pre-existing hierarchical structure annotations and discretized through the application of a well-known neural audio codec, EnCodec. The Sine StructureAPE positional encoding was applied to the music generation model StructurePE and was shown to outperform the baselines in the acoustic domain extending known results in the symbolic domain.
Recentemente, sono stati proposti numerosi modelli basati sull'apprendimento profondo per la generazione di musica. Un'applicazione particolare ed efficiente è la generazione controllabile di audio di alta qualità mediante la manipolazione di token audio. Tuttavia, non è stata prestata sufficiente attenzione all'organizzazione strutturale che contraddistingue i brani musicali. Ciò ha portato a una scarsità di generazione musicale originale e genuina, nonché alla mancanza di strutture creative efficaci basate sui contenuti. Inoltre, la mancanza di dataset acustici accuratamente annotati porta ad affidarsi a un'estrazione molto generica di rappresentazioni numeriche prive di significato semantico o addirittura all'uso esclusivo di dataset MIDI. Questo lavoro di tesi presenta Audio909, un dataset acustico multitraccia di 909 esecuzioni di pianoforte. Inoltre, un sottoinsieme di Audio909 viene utilizzato per addestrare un trasformatore con codifica posizionale strutturale per la generazione di musica che sfrutta le annotazioni musicali per un nuovo metodo di codifica posizionale informato sulla struttura. Il set di dati acustici Audio909 è stato ottenuto pilotando un pianoforte verticale a riproduzione automatica con un set di dati MIDI preesistenti. Ogni registrazione è stata acquisita utilizzando cinque diverse tecniche di registrazione, tra cui configurazioni monofoniche e stereofoniche, con vari gradi di cattura del riverbero della stanza. Il sottoinsieme di registrazioni utilizzato per la generazione della musica è stato sincronizzato con le annotazioni strutturali gerarchiche preesistenti e discretizzato mediante l'applicazione di un noto codificatore audio neurale, EnCodec. Il metodo di codifica posizionale, Sine StructureAPE, è stato applicato al modello di generazione musicale StructurePE e ha dimostrato di superare le metodologie base di riferimento nel dominio acustico estendendo i risultati già ottenuti nel dominio simbolico.
Controllable music generation with neural discrete representations and a multitrack MIDI-to-audio dataset
Gionfriddo, Matteo
2023/2024
Abstract
Recently, a large number of models based on deep learning have been proposed for the generation of music. A particular and efficient application is the controllable generation of high-quality audio by manipulation of audio tokens. However, inadequate attention has been given to the structural organization that distinguishes musical pieces. This has resulted in a dearth of original and genuine musical generation, as well as a lack of effective content-based creative frameworks. Moreover, the lack of strongly annotated acoustic datasets leads to relying on a very generic extraction of embeddings that doesn't have semantic meaning or even to the exclusive use of MIDI datasets. This thesis work presents Audio909, a multi-track acoustic dataset of 909 piano performances. Furthermore, this work uses a subset of Audio909 to train a transformer with structure positional encoding for music generation that exploits musical annotations for a novel structure-informed positional encoding method. The Audio909 dataset was obtained by driving a self-playing upright piano with a pre-existing MIDI dataset. Each recording was captured using five different recording techniques, including monophonic and stereophonic configurations, as well as varying degrees of room reverberation. The subset of recordings employed for the generation of music was synchronized with its pre-existing hierarchical structure annotations and discretized through the application of a well-known neural audio codec, EnCodec. The Sine StructureAPE positional encoding was applied to the music generation model StructurePE and was shown to outperform the baselines in the acoustic domain extending known results in the symbolic domain.File | Dimensione | Formato | |
---|---|---|---|
2024_10_Gionfriddo_Thesis.pdf
non accessibile
Dimensione
18.72 MB
Formato
Adobe PDF
|
18.72 MB | Adobe PDF | Visualizza/Apri |
2024_10_Gionfriddo_Executive Summary.pdf
non accessibile
Dimensione
2.36 MB
Formato
Adobe PDF
|
2.36 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/227430