This thesis investigates the effectiveness of deep learning techniques for the task of Au- tomated Audio Captioning, which consists in generating natural language descriptions of audio content. We employ a publicly available sequence-to-sequence model as a baseline. The encoder is a convolutional neural network based on ConvNeXt, which extracts audio embeddings, while the decoder is a vanilla Transformer that receives the embeddings and generates the corresponding captions. To maintain semantic consistency while introduc- ing linguistic variation and to enhance the model’s performance, we integrate two data augmentation techniques: back-translation and paraphrasing using the LLaMA large lan- guage model. Both techniques increase the number of available captions in the Clotho dataset from 5 to 10 per audio sample, thereby enriching the training data. Both methods achieve improved audio captioning performance across standard metrics—BLEU, ME- TEOR, and ROUGE—evaluated on the Clotho dataset. Specifically, the back-translation technique achieves scores of 0.636 in BLEU-1, 0.197 in BLEU-4, 0.417 in ROUGE-L, and 0.201 in METEOR. Furthermore, this thesis introduces the novel capability to translate captions into multiple languages (Italian and Portuguese) using the pre-trained Helsinki- NLP/opus-mt-en-ROMANCE model, paving the way for wider adoption in multilingual contexts.
Questa tesi indaga l’efficacia delle tecniche di deep learning nel compito di Automated Audio Captioning, che consiste nel generare descrizioni in linguaggio naturale del con- tenuto audio. Come baseline, utilizziamo un modello sequence-to-sequence disponibile pubblicamente. L’encoder è una rete neurale convoluzionale basata su ConvNeXt, che estrae gli embedding audio, mentre il decoder è un Transformer vanilla che riceve gli embedding e genera le didascalie corrispondenti. Per mantenere la coerenza semantica, introdurre variazioni linguistiche e migliorare le prestazioni del modello, integriamo due tecniche di data augmentation: back-translation e parafrasi utilizzando il modello di lin- guaggio LLaMA. Entrambe le tecniche aumentano il numero di didascalie disponibili nel dataset Clotho da 5 a 10 per ogni campione audio, arricchendo così i dati di addestra- mento. Entrambi i metodi ottengono un miglioramento nelle performance di audio cap- tioning secondo metriche standard—BLEU, METEOR e ROUGE—valutate sul dataset Clotho. In particolare, la tecnica di back-translation raggiunge i punteggi di 0.636 in BLEU-1, 0.197 in BLEU-4, 0.417 in ROUGE-L e 0.201 in METEOR. Inoltre, questa tesi introduce la nuova capacità di tradurre le didascalie in più lingue (italiano e portoghese) utilizzando il modello pre-addestrato Helsinki-NLP/opus-mt-en-ROMANCE, aprendo la strada a un’adozione più ampia in contesti multilingue.
Investigating the task of automated audio captioning: enhancements through data augmentation and multilingual translation
COPPOLA, ANTONINO FABIO
2024/2025
Abstract
This thesis investigates the effectiveness of deep learning techniques for the task of Au- tomated Audio Captioning, which consists in generating natural language descriptions of audio content. We employ a publicly available sequence-to-sequence model as a baseline. The encoder is a convolutional neural network based on ConvNeXt, which extracts audio embeddings, while the decoder is a vanilla Transformer that receives the embeddings and generates the corresponding captions. To maintain semantic consistency while introduc- ing linguistic variation and to enhance the model’s performance, we integrate two data augmentation techniques: back-translation and paraphrasing using the LLaMA large lan- guage model. Both techniques increase the number of available captions in the Clotho dataset from 5 to 10 per audio sample, thereby enriching the training data. Both methods achieve improved audio captioning performance across standard metrics—BLEU, ME- TEOR, and ROUGE—evaluated on the Clotho dataset. Specifically, the back-translation technique achieves scores of 0.636 in BLEU-1, 0.197 in BLEU-4, 0.417 in ROUGE-L, and 0.201 in METEOR. Furthermore, this thesis introduces the novel capability to translate captions into multiple languages (Italian and Portuguese) using the pre-trained Helsinki- NLP/opus-mt-en-ROMANCE model, paving the way for wider adoption in multilingual contexts.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_07_Coppola.pdf
solo utenti autorizzati a partire dal 25/06/2028
Descrizione: Testo della tesi
Dimensione
7.09 MB
Formato
Adobe PDF
|
7.09 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/240020