Linking signal and semantic representations of musical content for music information retrieval

In recent years, the evolution of technology and connectivity has led to a major revolution in the music industry, which has led to novel content consumption scenarios. People, in fact, are suddenly given the possibility to easily access millions of songs through multiple browsing/diffusion/streaming platforms and are therefore in need of new navigational/browsing tools, music recommendation assistants, virtual clerks, etc. Music Information Retrieval (MIR) is a multi-disciplinary research field that addresses the issues raised by the design of search/navigation/annotation strategies, in order to develop those tools, like content classification and rich annotation algorithms, that are required to support such scenarios. The design of tools and applications for MIR-related scenarios requires us to model and account for two different levels of abstraction in the musical content description: the semantic level and the signal-based level. The former concerns how we subjectively perceive and interpret the musical properties, which terms we choose to describe them and how we use such terms to discuss about music. The latter involves nailing down the objective properties of the musical signals, ranging from those that can be directly computed from the signal to those that can be inferred from it, such as the rhythmic, tonal, etc. Such two description levels, however, are very separated from each other, which is why MIR research today is focusing exactly on bridging this critical gap. In order to develop MIR applications, the modern approaches must take into account all abstraction levels in the description of musical content. This means that, in addition to considering the signal domain and the semantic domain, such approaches focus on the linking function between them. In this work we follow a schema that involves the formalization of the signal and the semantic domains, as well as the design of the related linking function. The thesis begins with the discussion of the possible formalization of the signal domain based on the extraction of a feature representation of the musical content. This formalization requires a deep knowledge of the musical properties and the features that are able to capture them. As part of this formalization process, we show an example of application of feature-based analysis which is suitable for the scenario of Networked Music Performance. Through this scenario we show how to estimate some of the musical properties that appear to have an impact on the overall quality of the networked performance experience. We also discuss how deep learning techniques are suitable for automatically extracting (or learning) an effective feature representation of the musical content. We first describe how to design the linking function by means of rule-based techniques, which follow a manually-designed algorithm. More often, however, the link between the two domains is not clear and it is hard to design it by means of a procedural solution. In such cases, we show how to use machine learning techniques to automatically learn and predict the relation between the two domains. The semantic domain can be instead formalized following two main approaches: the categorical approach, that defines which descriptors are feasible to represent a given song; and the dimensional approach, that also specifies how much the aforementioned descriptors represent the given song. The set of semantic descriptors can be extended by including a semantic similarity among them, either manually-defined or automatically-inferred; the resulting set of semantic descriptors and similarities is referred to as semantic model. In this regard, we conduct a specific research activity to enrich a manually-defined generic dataset of dimensional descriptors with music-specific information automatically inferred from users' annotations. We also consider the structure of songs as a special case of formalization of the semantic domain. After the main components are defined and formalized, we apply the schema to a number of MIR-related application scenarios of gradually increasing complexity. The first application scenario involves the analysis and extraction of the song's structure. We rely on deep learning techniques to automatically extract a feature-based representation of the signal domain. By doing so, we address the uncertainty issue about which properties are suitable for describing the musical content for the task of musical structure analysis. Since the semantic domain is well formalized, we are able to apply rule-based techniques from the MIR literature to retrieve the structure. In the second application scenario, we address the detection of bootlegs, i.e., unauthorized recordings of live performances. We formalize the semantic domain by following the categorical approach, which consists of describing a given song as either a bootleg, or an official live performance, or a studio recording. We employ learned features to formalize the signal domain and machine learning techniques to tackle the complexity of the definition of the linking function. The third application scenario concerns the automatic annotation of the recording of violins with their timbral qualities. The timbral properties of violins are described by rather imprecise semantics, which we formalize through of a set of six dimensional descriptors. Albeit the number of employed descriptors is rather limited, the adoption of a dimensional approach helps us increase the expressiveness of this semantic model. As done in the previous application scenarios, we employ learned features to address the uncertainty in the formalization of the signal domain and machine learning techniques to automatically design the linking function and ultimately assess the timbral qualities of the instrument. In the last scenario we address the definition of a semantic model by considering the ambiguities that commonly occur in natural language. We investigate a formalization of the semantic domain that is able to address the ambiguity issues raised by polysemy, which occurs when descriptors take on different meanings when used within different semantic contexts. We embed the defined semantic model (which is based on overlapping semantic contexts and on context-dependent semantic similarities) in a prototype of music search engine. This prototype represents a novel application that allows users to retrieve musical content by using natural language queries.

In questi anni, l'evoluzione della tecnologia e delle reti di connessione hanno portato ad una rivoluzione negli scenari di fruizione della musica. Nel giro di un decennio, milioni di brani musicali sono diventati disponibili attraverso molteplici piattaforme di streaming o diffusione, rendendo necessari nuovi strumenti di navigazione, suggerimenti automatici su cosa ascoltare, negozi virtuali, etc. Il Music Information Retrieval (MIR) è un campo di ricerca multidisciplinare che affronta le problematiche relative all'ideazione di nuove strategie di ricerca, navigazione e annotazione al fine di implementare gli strumenti necessari al supporto di questi scenari, come ad esempio gli algoritmi di annotazione e classificazione automatica dei brani. La progettazione degli strumenti per questi scenari del MIR richiede di poter descrivere la musica da due diversi punti di vista: quello legato alla semantica e quello legato al segnale. Il primo riguarda il modo in cui percepiamo soggettivamente e interpretiamo le caratteristiche musicali, quali termini scegliamo per descriverle e come usiamo questi termini quando parliamo di musica. Il secondo invece riguarda le proprietà oggettive dei segnali musicali, da quelle direttamente calcolabili a quelle meno dirette, come le proprietà ritmiche o tonali. Questi due livelli di astrazione però sono come mondi separati, che la ricerca sul MIR cerca di avvicinare e collegare. Gli approcci moderni per lo sviluppo di applicazioni MIR devono prendere in considerazione tutti i livelli di astrazione della descrizione musicale, per poter sviluppare le applicazioni MIR. Ciò significa che, oltre a considerare il dominio del segnale e il dominio semantico, questi approcci si concentrano anche sulla funzione di collegamento. In questa tesi, ci proponiamo di seguire uno schema che prevede la formalizzazione dei domínî del segnale e semantico e la progettazione della relativa funzione di collegamento. La tesi muove da una discussione delle diverse modalità di formalizzazione del dominio del segnale, partendo dall'estrazione di una rappresentazione del contenuto musicale basata su feature. Questa formalizzazione richiede una profonda conoscenza delle proprietà musicali e delle feature in grado di catturarle. Come parte di questa formalizzazione, perciò, discutiamo di una nostra ricerca volta a stimare come alcune proprietà musicali influenzino la qualità dell'esperienza di performance remota, tramite un'analisi delle feature audio. Inoltre, discutiamo di come le tecniche di apprendimento approfondito possono essere utilizzate per estrarre (o imparare) automaticamente un'efficace rappresentazione del contenuto musicale. Descriviamo anche le modalità di progettazione della funzione di collegamento iniziando dalle tecniche procedurali, che seguono un algoritmo progettato manualmente. Sovente però, il collegamento tra i due domínî non è chiara ed è quindi difficile riuscire a trovare una soluzione procedurale. In questi casi, è possibile usare tecniche di apprendimento automatico della relazione tra i due domínî. Il dominio semantico può invece essere formalizzato seguendo due approcci principali: quello per categorie, che definisce quali descrittori possono rappresentare un dato brano; e quello dimensionale, che specifica anche quanto i descrittori sono in grado di rappresentare il brano. Il set di descrittori semantici può essere anche arricchito includendo la similarità semantica --sia definita manualmente che appresa automaticamente-- tra i descrittori; l'insieme di descrittori e similarità viene chiamato modello semantico. Riguardo ciò, discutiamo anche una nostra ricerca volta ad arricchire un dataset generico di descrittori dimensionali con informazioni, specifiche per la musica, estratte automaticamente da annotazioni degli utenti. Consideriamo anche la struttura dei brani come un caso particolare di formalizzazione del dominio semantico. Una volta che i componenti principali sono definiti e formalizzati, utilizziamo questo schema in una serie di scenari applicativi di complessità crescente. Il primo scenario applicativo riguarda l'analisi ed estrazione della struttura musicale. Facciamo affidamento su tecniche di apprendimento approfondito per estrarre una rappresentazione del dominio del segnale. Così facendo, riusciamo a risolvere i problemi di incertezza legati alla scelta delle proprietà musicali utili a descrivere la struttura di brano. Poiché il dominio semantico è invece ben formalizzato, possiamo utilizzare delle tecniche procedurali comuni nella letteratura per estrarre la struttura. Nel secondo scenario applicativo affrontiamo il riconoscimento di bootleg, ovvero registrazioni non autorizzate di performance live. Il dominio semantico di questo scenario è formalizzabile assegnando ogni brano a una categoria tra le seguenti: bootleg, live ufficiale o registrazione in studio. Utilizziamo tecniche di apprendimento approfondito per formalizzare il dominio del segnale e tecniche di appendimento automatico per facilitare la progettazione della funzione di collegamento. Il terzo scenario applicativo concerne l'annotazione automatica delle qualità timbriche di violini, partendo da alcune registrazioni audio. Le qualità timbriche sono descritte da una semantica ambigua, che formalizziamo tramite sei descrittori dimensionali. L'approccio dimensionale ci permette di aumentare l'espressività del modello semantico altrimenti limitata dal numero ridotto di descrittori. Come nei precedenti scenari, il dominio del segnale è formalizzato con tecniche di apprendimento approfondito, e legato al dominio semantico tramite tecniche di apprendimento automatico. Nell'ultimo scenario ci occupiamo della definizione di un modello semantico che consideri le ambiguità frequenti nel linguaggio naturale. Investighiamo una formalizzazione del dominio semantico in grado di modellare la polisemia, ovvero i termini che assumono diversi significati quando usati in diversi contesti semantici. Usiamo il nostro modello semantico, basato su contesti semantici sovrapponibili e similarità semantiche diverse per contesti diversi, per implementare un prototipo di un motore di ricerca musicale: un'applicazione innovativa in grado di elaborare query in linguaggio naturale per trovare i brani musicali desiderati.

Linking signal and semantic representations of musical content for music information retrieval

BUCCOLI, MICHELE

Abstract

In recent years, the evolution of technology and connectivity has led to a major revolution in the music industry, which has led to novel content consumption scenarios. People, in fact, are suddenly given the possibility to easily access millions of songs through multiple browsing/diffusion/streaming platforms and are therefore in need of new navigational/browsing tools, music recommendation assistants, virtual clerks, etc. Music Information Retrieval (MIR) is a multi-disciplinary research field that addresses the issues raised by the design of search/navigation/annotation strategies, in order to develop those tools, like content classification and rich annotation algorithms, that are required to support such scenarios. The design of tools and applications for MIR-related scenarios requires us to model and account for two different levels of abstraction in the musical content description: the semantic level and the signal-based level. The former concerns how we subjectively perceive and interpret the musical properties, which terms we choose to describe them and how we use such terms to discuss about music. The latter involves nailing down the objective properties of the musical signals, ranging from those that can be directly computed from the signal to those that can be inferred from it, such as the rhythmic, tonal, etc. Such two description levels, however, are very separated from each other, which is why MIR research today is focusing exactly on bridging this critical gap. In order to develop MIR applications, the modern approaches must take into account all abstraction levels in the description of musical content. This means that, in addition to considering the signal domain and the semantic domain, such approaches focus on the linking function between them. In this work we follow a schema that involves the formalization of the signal and the semantic domains, as well as the design of the related linking function. The thesis begins with the discussion of the possible formalization of the signal domain based on the extraction of a feature representation of the musical content. This formalization requires a deep knowledge of the musical properties and the features that are able to capture them. As part of this formalization process, we show an example of application of feature-based analysis which is suitable for the scenario of Networked Music Performance. Through this scenario we show how to estimate some of the musical properties that appear to have an impact on the overall quality of the networked performance experience. We also discuss how deep learning techniques are suitable for automatically extracting (or learning) an effective feature representation of the musical content. We first describe how to design the linking function by means of rule-based techniques, which follow a manually-designed algorithm. More often, however, the link between the two domains is not clear and it is hard to design it by means of a procedural solution. In such cases, we show how to use machine learning techniques to automatically learn and predict the relation between the two domains. The semantic domain can be instead formalized following two main approaches: the categorical approach, that defines which descriptors are feasible to represent a given song; and the dimensional approach, that also specifies how much the aforementioned descriptors represent the given song. The set of semantic descriptors can be extended by including a semantic similarity among them, either manually-defined or automatically-inferred; the resulting set of semantic descriptors and similarities is referred to as semantic model. In this regard, we conduct a specific research activity to enrich a manually-defined generic dataset of dimensional descriptors with music-specific information automatically inferred from users' annotations. We also consider the structure of songs as a special case of formalization of the semantic domain. After the main components are defined and formalized, we apply the schema to a number of MIR-related application scenarios of gradually increasing complexity. The first application scenario involves the analysis and extraction of the song's structure. We rely on deep learning techniques to automatically extract a feature-based representation of the signal domain. By doing so, we address the uncertainty issue about which properties are suitable for describing the musical content for the task of musical structure analysis. Since the semantic domain is well formalized, we are able to apply rule-based techniques from the MIR literature to retrieve the structure. In the second application scenario, we address the detection of bootlegs, i.e., unauthorized recordings of live performances. We formalize the semantic domain by following the categorical approach, which consists of describing a given song as either a bootleg, or an official live performance, or a studio recording. We employ learned features to formalize the signal domain and machine learning techniques to tackle the complexity of the definition of the linking function. The third application scenario concerns the automatic annotation of the recording of violins with their timbral qualities. The timbral properties of violins are described by rather imprecise semantics, which we formalize through of a set of six dimensional descriptors. Albeit the number of employed descriptors is rather limited, the adoption of a dimensional approach helps us increase the expressiveness of this semantic model. As done in the previous application scenarios, we employ learned features to address the uncertainty in the formalization of the signal domain and machine learning techniques to automatically design the linking function and ultimately assess the timbral qualities of the instrument. In the last scenario we address the definition of a semantic model by considering the ambiguities that commonly occur in natural language. We investigate a formalization of the semantic domain that is able to address the ambiguity issues raised by polysemy, which occurs when descriptors take on different meanings when used within different semantic contexts. We embed the defined semantic model (which is based on overlapping semantic contexts and on context-dependent semantic similarities) in a prototype of music search engine. This prototype represents a novel application that allows users to retrieve musical content by using natural language queries.

Scheda breve

Scheda completa

	Relatore
	
				SARTI, AUGUSTO
			
	Coordinatore
	
				BONARINI, ANDREA
			
	Tutor
	
				BERTUCCIO, GIUSEPPE
			
	Correlatore/i
	
				ZANONI, MASSIMILIANO
			
	Data
	
				22-feb-2017
			
	Abstract in italiano
	
				In questi anni, l'evoluzione della tecnologia e delle reti di connessione hanno portato ad una rivoluzione negli scenari di fruizione della musica. Nel giro di un decennio, milioni di brani musicali sono diventati disponibili attraverso molteplici piattaforme di streaming o diffusione, rendendo necessari nuovi strumenti di navigazione, suggerimenti automatici su cosa ascoltare, negozi virtuali, etc. Il Music Information Retrieval (MIR) è un campo di ricerca multidisciplinare che affronta le problematiche relative all'ideazione di nuove strategie di ricerca, navigazione e annotazione al fine di implementare gli strumenti necessari al supporto di questi scenari, come ad esempio gli algoritmi di annotazione e classificazione automatica dei brani.

La progettazione degli strumenti per questi scenari del MIR richiede di poter descrivere la musica da due diversi punti di vista: quello legato alla semantica e quello legato al segnale. Il primo riguarda il modo in cui percepiamo soggettivamente e interpretiamo le caratteristiche musicali, quali termini scegliamo per descriverle e come usiamo questi termini quando parliamo di musica. Il secondo invece riguarda le proprietà oggettive dei segnali musicali, da quelle direttamente calcolabili a quelle meno dirette, come le proprietà ritmiche o tonali. Questi due livelli di astrazione però sono come mondi separati, che la ricerca sul MIR cerca di avvicinare e collegare.

Gli approcci moderni per lo sviluppo di applicazioni MIR devono prendere in considerazione tutti i livelli di astrazione della descrizione musicale, per poter sviluppare le applicazioni MIR. Ciò significa che, oltre a considerare il dominio del segnale e il dominio semantico, questi approcci si concentrano anche sulla funzione di collegamento. In questa tesi, ci proponiamo di seguire uno schema che prevede la formalizzazione dei domínî del segnale e semantico e la progettazione della relativa funzione di collegamento.

La tesi muove da una discussione delle diverse modalità di formalizzazione del dominio del segnale, partendo dall'estrazione di una rappresentazione del contenuto musicale basata su feature. Questa formalizzazione richiede una profonda conoscenza delle proprietà musicali e delle feature in grado di catturarle. Come parte di questa formalizzazione, perciò, discutiamo di una nostra ricerca volta a stimare come alcune proprietà musicali influenzino la qualità dell'esperienza di performance remota, tramite un'analisi delle feature audio. Inoltre, discutiamo di come le tecniche di apprendimento approfondito possono essere utilizzate per estrarre (o imparare) automaticamente un'efficace rappresentazione del contenuto musicale.

Descriviamo anche le modalità di progettazione della funzione di collegamento iniziando dalle tecniche procedurali, che seguono un algoritmo progettato manualmente. Sovente però, il collegamento tra i due domínî non è chiara ed è quindi difficile riuscire a trovare una soluzione procedurale. In questi casi, è possibile usare tecniche di apprendimento automatico della relazione tra i due domínî. 

Il dominio semantico può invece essere formalizzato seguendo due approcci principali: quello per categorie, che definisce quali descrittori possono rappresentare un dato brano; e quello dimensionale, che specifica anche quanto i descrittori sono in grado di rappresentare il brano. Il set di descrittori semantici può essere anche arricchito includendo la similarità semantica --sia definita manualmente che appresa automaticamente-- tra i descrittori; l'insieme di descrittori e similarità viene chiamato modello semantico. Riguardo ciò, discutiamo anche una nostra ricerca volta ad arricchire un dataset generico di descrittori dimensionali con informazioni, specifiche per la musica, estratte automaticamente da annotazioni degli utenti. Consideriamo anche la struttura dei brani come un caso particolare di formalizzazione del dominio semantico.

Una volta che i componenti principali sono definiti e formalizzati, utilizziamo questo schema in una serie di scenari applicativi di complessità crescente.

Il primo scenario applicativo riguarda l'analisi ed estrazione della struttura musicale. Facciamo affidamento su tecniche di apprendimento approfondito per estrarre una rappresentazione del dominio del segnale. Così facendo, riusciamo a risolvere i problemi di incertezza legati alla scelta delle proprietà musicali utili a descrivere la struttura di brano. Poiché il dominio semantico è invece ben formalizzato, possiamo utilizzare delle tecniche procedurali comuni nella letteratura per estrarre la struttura.

Nel secondo scenario applicativo affrontiamo il riconoscimento di bootleg, ovvero registrazioni non autorizzate di performance live. Il dominio semantico di questo scenario è formalizzabile assegnando ogni brano a una categoria tra le seguenti: bootleg, live ufficiale o registrazione in studio. Utilizziamo tecniche di apprendimento approfondito per formalizzare il dominio del segnale e tecniche di appendimento automatico per facilitare la progettazione della funzione di collegamento.

Il terzo scenario applicativo concerne l'annotazione automatica delle qualità timbriche di violini, partendo da alcune registrazioni audio. Le qualità timbriche sono descritte da una semantica ambigua, che formalizziamo tramite sei descrittori dimensionali. L'approccio dimensionale ci permette di aumentare l'espressività del modello semantico altrimenti limitata dal numero ridotto di descrittori. Come nei precedenti scenari, il dominio del segnale è formalizzato con tecniche di apprendimento approfondito, e legato al dominio semantico tramite tecniche di apprendimento automatico.

Nell'ultimo scenario ci occupiamo della definizione di un modello semantico che consideri le ambiguità frequenti nel linguaggio naturale. Investighiamo una formalizzazione del dominio semantico in grado di modellare la polisemia, ovvero i termini che assumono diversi significati quando usati in diversi contesti semantici. Usiamo il nostro modello semantico, basato su contesti semantici sovrapponibili e similarità semantiche diverse per contesti diversi, per implementare un prototipo di un motore di ricerca musicale: un'applicazione innovativa in grado di elaborare query in linguaggio naturale per trovare i brani musicali desiderati.
			
	Tipo di documento
	
				Tesi di dottorato
			
	Appare nelle tipologie:
	
				Tesi di Dottorato

File allegati

File	Dimensione	Formato
thesis.pdf accessibile in internet per tutti Descrizione: Testo della tesi Dimensione 16.36 MB Formato Adobe PDF Visualizza/Apri	16.36 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/132057