Music Source Separation (MSS) algorithms are widely used, allowing artists and researchers alike to separate mixed tracks into stems, typically four, including vocals, bass, drums and others. Drums Demixing (DDX) is a subfield of MSS aiming to obtain isolated drum stems from a drums mixture. As of today, in order to extract both instrumental and drum stems from a song, it is necessary to first apply a MSS algorithm and, subsequently, a DDX model to the drums stem isolated by the former. The goal of this thesis is to evaluate the performance of a single model tasked with jointly performing MSS and DDX, comparing it to current two-stage approaches. To do so, two state-of-the-art architectures were trained and tested as MSS (4-stem), DDX (5-stem), and MSS+DDX (8-stem) models. Results show that single-stage configurations are less effective than their two-stage counterparts in terms of signal-to-distortion ratio, while requiring roughly half the time for inference on the same hardware, highlighting a trade-off between performance and computational requirements.
Gli algoritmi di Music Source Separation (MSS) sono ampiamente usati, permettendo sia a musicisti che ricercatori di separare tracce mixate in stem, tipicamente quattro, fra cui voci, basso, batteria e altri. Il Drums Demixing (DDX) è un sottocampo di MSS che punta ad ottenere stem isolati delle singole parti della batteria da una traccia di sola batteria. Ad oggi, per estrarre sia le parti strumentali che di batteria da una canzone, è necessario applicare prima un algoritmo di MSS e, successivamente, un modello di DDX allo stem di batteria isolato dal primo. L’obiettivo di questa tesi è valutare la performance di un singolo modello che svolge contemporaneamente MSS e DDX, confrontandolo con gli attuali approcci a due stadi. Per far ciò, due architetture allo stato dell’arte sono state addestrate e testate come modelli di MSS (a 4 stem), DDX (a 5 stem) e MSS+DDX (a 8 stem). I risultati mostrano che le configurazioni a singolo stadio sono meno efficaci rispetto alle loro controparti a due stadi in termini di rapporto segnale/distorsione, richiedendo tuttavia circa metà del tempo per l’inferenza a parità di hardware, evidenziando un compromesso fra performance e requisiti di sistema.
Towards joint music source separation and deep drums demixing
COLOTTI, FRANCESCO
2023/2024
Abstract
Music Source Separation (MSS) algorithms are widely used, allowing artists and researchers alike to separate mixed tracks into stems, typically four, including vocals, bass, drums and others. Drums Demixing (DDX) is a subfield of MSS aiming to obtain isolated drum stems from a drums mixture. As of today, in order to extract both instrumental and drum stems from a song, it is necessary to first apply a MSS algorithm and, subsequently, a DDX model to the drums stem isolated by the former. The goal of this thesis is to evaluate the performance of a single model tasked with jointly performing MSS and DDX, comparing it to current two-stage approaches. To do so, two state-of-the-art architectures were trained and tested as MSS (4-stem), DDX (5-stem), and MSS+DDX (8-stem) models. Results show that single-stage configurations are less effective than their two-stage counterparts in terms of signal-to-distortion ratio, while requiring roughly half the time for inference on the same hardware, highlighting a trade-off between performance and computational requirements.File | Dimensione | Formato | |
---|---|---|---|
2024_10_Colotti_tesi_01.pdf
non accessibile
Descrizione: tesi, formato articolo
Dimensione
1.17 MB
Formato
Adobe PDF
|
1.17 MB | Adobe PDF | Visualizza/Apri |
2024_10_Colotti_executive_summary_02.pdf
non accessibile
Descrizione: executive summary
Dimensione
512.79 kB
Formato
Adobe PDF
|
512.79 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/227677