The success of CNNs and Transformers has generated interest in multi- modal models, particularly those that combine visual and language inputs. An exploratory analysis is conducted to investigate the effectiveness of this kind of large multi-modal models in extracting scene graphs from an image. In this con- text, scene graph extraction describe an image by listing object relationships using triplets in a structured format. For this purpose, a specialized scene graph extrac- tion model and two visual question and answering (VQA) models were selected. The VQA models were prompted to extract scene graphs in the same format as the first model, by employing an approach reminiscent of in-context learning. The VQA’s prompt includes the required objects and predicates, thereby providing instructions paired with all the elements needed to construct the graph. Multi- modal explainability has also emerged thanks to advances in explainability aiming to establish confidence in machine learning models and to detect biases and er- rors. Thanks to multi-modal explainability and to the format of the VQA model prompts, it is possible to understand not only why errors occur or to identify miss- ing information but also to provide insights in a controlled manner, enabling more precise identification of weaknesses of the model.
Il successo delle CNN e dei Transformers ha suscitato interesse nei modelli multimodali, in particolare quelli che combinano input visivi e linguistici. È stata condotta un’analisi esplorativa per indagare l’efficacia di questi modelli multimodali nell’estrazione di scene graphs da un’immagine. In questo contesto, generare un scene graph significa descrivere un’immagine elencando le relazioni tra oggetti utilizzando triplette in un formato strutturato. A tal fine, sono stati selezionati un modello specializzato per l’estrazione di scene graphs e due modelli di visual question answering (VQA). I modelli VQA sono stati configurati per estrarre scene graphs nello stesso modo del primo modello, impiegando un approccio simile a in-context learning. Il prompt dei modelli VQA include gli oggetti e i predicati richiesti, fornendo così istruzioni con gli elementi necessari per costruire il grafo. Grazie ai progressi in Explainability, è emerso anche Multi-modal Explainability, che mira a stabilire fiducia nei modelli di machine learning e a rilevare bias ed errori. Grazie a Multi-modal Explainability e al formato dei prompt dei modelli VQA, è possibile comprendere non solo perché si verificano errori o identificare informazioni mancanti, ma anche fornire osservazioni in modo controllato, consentendo un’identificazione più precisa delle debolezze del modello.
Exploring the capabilities of generative image captioning models for producing structured output
BRUGNOLI, ALICE
2023/2024
Abstract
The success of CNNs and Transformers has generated interest in multi- modal models, particularly those that combine visual and language inputs. An exploratory analysis is conducted to investigate the effectiveness of this kind of large multi-modal models in extracting scene graphs from an image. In this con- text, scene graph extraction describe an image by listing object relationships using triplets in a structured format. For this purpose, a specialized scene graph extrac- tion model and two visual question and answering (VQA) models were selected. The VQA models were prompted to extract scene graphs in the same format as the first model, by employing an approach reminiscent of in-context learning. The VQA’s prompt includes the required objects and predicates, thereby providing instructions paired with all the elements needed to construct the graph. Multi- modal explainability has also emerged thanks to advances in explainability aiming to establish confidence in machine learning models and to detect biases and er- rors. Thanks to multi-modal explainability and to the format of the VQA model prompts, it is possible to understand not only why errors occur or to identify miss- ing information but also to provide insights in a controlled manner, enabling more precise identification of weaknesses of the model.File | Dimensione | Formato | |
---|---|---|---|
2024_07_BRUGNOLI_TESI.PDF
accessibile in internet per tutti
Descrizione: Tesi
Dimensione
27.54 MB
Formato
Adobe PDF
|
27.54 MB | Adobe PDF | Visualizza/Apri |
2024_07_BRUGNOLI_EXECUTIVESUMMARY.PDF
accessibile in internet per tutti
Descrizione: Executive Summary
Dimensione
1.94 MB
Formato
Adobe PDF
|
1.94 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/223438