Transformer networks for the modelling of jazz harmony

Music is a multi-layered form of art and a language of its own. It can in fact be considered one of the most ancient forms of communication between humans and can therefore be studied with the same tools we use to study natural languages. In order to define a clear linguistic framework and to properly apply it to the musical phenomenon we need to separate music into its atomic components, which can be considered as melody, harmony and rhythm, and then take into account their mutual interactions. In this work we mainly focus on harmony and on it’s interaction with rhythm and musical structure which creates the so-called harmonic rhythm. Harmonic rules and praxis are widely recognized to be very much culturally dependent and it is well known that some chord progression can sound very weird for someone with a specific cultural background and at the same time very familiar for someone with a different one. In this thesis we specifically focused on Jazz harmony, which is a harmonic framework mostly based on the traditional western music harmonic rules. In this context we investigated how to define the perceived “complexity” of an harmonic sequence and we tried to relate it to its unpredictability. Predictable sequences should be the ones for which the listener could easily guess in advance the next chord thanks to the presence of some previously heard common pattern. In this framework we can define as complex a sequence which is hard to predict and which creates in the listeners a sense of unsatisfied expectation. On the other hand, simple sequences are the ones which completely fulfill the expectation for the next chords and which should be simple to predict, both for a human listener and for a language model. Furthermore, we investigated the existence of a correlation pattern between the perceived complexity annotated from a set of listeners and the ability of a deep learning model to predict the next chord in a sequence. For the purpose of this work we trained a NN model based on the GPT architecture proposed by OpenAI in 2019. This architecture is a state of the art model for NLP, a field of study which investigates how to model natural languages using NNs. The innovative aspect of the Transformer resides in its specific attention mechanism which allows it to capture long term dependencies within the input sequences without the use of recurrency. We trained the model with two versions of a novel dataset containing more than 100 000 chord annotations taken from the well known Real Book of Jazz, a collection of lead sheets of the so-called Standards of jazz music. Furthermore, we validated the model complexity estimates exploiting perceptual complexity ratings by means of a listening test. Even though a strong correlation between the cross-entropy calculated from the model and the perceptual ratings of a group of listeners was shown by Di Giorgi et al., we actually did not observe this correlation in our experiments. This could be due to the high level of musical sophistication implied in the repertoire that we used for the training, resulting in difficulties for most non music trained listeners in decoding the sequences by ear and in properly evaluating their complexity. Furthermore, the Jazz vocabulary is not as widely diffused as the Pop or Rock ones, so listeners, on average, lack the necessary amount of experience of this particular music genre that is needed in order to properly identify what is a common jazz sequence and what is a very uncommon one. As far as chord prediction is concerned, we can confirm that a GPT based architecture can produce coherent sequences of chords and that can learn harmonic rhythm patterns as well, a feature which can be used in interesting ways as a composition assistant tool.

La musica è una forma d'arte che si esprime su vari livelli e rappresenta un linguaggio a sé stante. Infatti, può essere considerata una delle più antiche forme di comunicazione tra essere umani e può quindi essere studiata con gli stessi strumenti con cui si approcciano i linguaggi naturali. Per definire un chiaro approccio di studio linguistico alla musica, è necessario separarla nei suoi aspetti fondamentali, che possiamo considerare come melodia, armonia e ritmo, e studiare le rispettive interazioni. In questo lavoro ci concentriamo principalmente sull'armonia e sul suo legame con il ritmo. Le regole e le prassi dell'armonia sono universalmente riconosciute come essere in gran parte dipendenti dalla cultura di appartenenza. E' infatti risaputo che la stessa sequenza di accordi possa suonare come assolutamente banale per un individuo e contemporaneamente come completamente imprevista per un altro appartenente a una diversa cultura. In questa trattazione ci siamo concentrati sull'armonia Jazz, una prassi armonica che può a grandi linee essere inscritta all'interno dell'insieme della cultura musicale occidentale, ma che presenta comunque alcune caratteristiche molte specifiche e peculiari. In particolare, abbiamo investigato come definire il concetto di complessità armonica associata a una sequenza di accordi e abbiamo cercato di legarlo alla sua imprevedibilità. Infatti, sequenze prevedibili dovrebbero essere percepite come poco complesse, mentre sequenze molto improbabili dovrebbero essere percepite come estremamente complesse. Inoltre, abbiamo investigato la presenza di una correlazione tra la complessità percepita di una sequenza e l'abilità di un modello informatico di predirla. Per la scrittura di questa tesi abbiamo implementato un modello basato sull'architettura GPT-2 proposta da OpenAI nel 2019. Questo modello rappresenta una delle ultime proposte nel campo del NLP, una branca dell'informatica che studia i linguaggi naturali. Durante il presente lavoro abbiamo allenato il modello con un database originale di nostra proposta trascritto dall'applicazione iRealBook creata da Massimo Biolcati nel 2010. Il database utilizzato contiene più di 100 000 sequenze di accordi in tutte le tonalità tratte dai vari volumi del noto Real Book, uno storico archivio di trascrizioni dei cosiddetti Standards della musica Jazz. Inoltre abbiamo valutato la capacità del modello di predire la complessità percepita delle sequenze di accordi tramite un test di ascolto. Anche se una forte correlazione negativa tra capacità predittiva del modello e complessità percepita era stata dimostrata da Di Giorgi et al., non abbiamo trovato la suddetta correlazione all'interno dei nostri dati. Questo può essere dovuto a varie ragioni, tra cui il più alto grado di sofisticazione del repertorio incluso nel database usato e la minore diffusione del Jazz rispetto ad altri generi come il Pop o il Rock. Per quanto invece riguarda l'obiettivo di modellare le regole e le prassi dell'armonia jazz possiamo confermare che il modello GPT-2 produce sequenze di accordi coerenti con il database con cui è stato allenato. Inoltre abbiamo evidenziato come il modello abbia efficacemente imparato anche il concetto di ritmo armonico, caratteristica che potrebbe essere efficacemente sfruttata come strumento di composizione assistita.