Explainable multimodal fusion

Biblioteche e Archivi
POLITesi - Archivio digitale delle tesi di laurea e di dottorato

Recently, there has been a lot of interest in explainable predictions, with new explainability approaches being created for specific data modalities like images and text. However, there is a dearth of understanding and minimal exploration in terms of explainability in the multimodal machine learning domain, where diverse data modalities are fused together in the model. In this thesis project, we look into two multimodal model architectures namely single-stream and dual-stream for the Visual Entailment (VE) task which compromises of image and text modalities. The models considered in this project are UNiversal ImageTExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) and Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Furthermore, we conduct three different experiments for multimodal explainability by applying the Local Interpretable Model-agnostic Explanations (LIME) technique. Our results show that UNITER has the best accuracy among these models for the problem of VE. However, the explainability of all these models is similar.

Di recente, c'è stato molto interesse per le previsioni spiegabili, con la creazione di nuovi approcci di spiegabilità per modalità di dati specifiche come immagini e testo. Tuttavia, c'è una carenza di comprensione e un'esplorazione minima in termini di spiegabilità nel dominio dell'apprendimento automatico multimodale, in cui diverse modalità di dati sono fuse insieme nel modello. In questo progetto di tesi, esaminiamo due architetture di modelli multimodali, ovvero single-stream e dual-stream per il compito Visual Entailment (VE) che compromette le modalità di immagine e testo. I modelli considerati in questo progetto sono UNiversal ImageTExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) e Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Inoltre, conduciamo tre diversi esperimenti per la spiegabilità multimodale applicando la tecnica LIME (Local Interpretable Model-agnostic Explanations). I nostri risultati mostrano che UNITER ha la migliore accuratezza tra questi modelli per il problema di VE. Tuttavia, la spiegabilità di tutti questi modelli è simile.

Explainable multimodal fusion

Alvi, Jaweriah

2020/2021

Abstract

Recently, there has been a lot of interest in explainable predictions, with new explainability approaches being created for specific data modalities like images and text. However, there is a dearth of understanding and minimal exploration in terms of explainability in the multimodal machine learning domain, where diverse data modalities are fused together in the model. In this thesis project, we look into two multimodal model architectures namely single-stream and dual-stream for the Visual Entailment (VE) task which compromises of image and text modalities. The models considered in this project are UNiversal ImageTExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) and Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Furthermore, we conduct three different experiments for multimodal explainability by applying the Local Interpretable Model-agnostic Explanations (LIME) technique. Our results show that UNITER has the best accuracy among these models for the problem of VE. However, the explainability of all these models is similar.

Scheda breve

Scheda completa

	Relatore
	
			CREMONESI, PAOLO
		
	Correlatore/i
	
			PAYBERAH, AMIR H.
ERLIKSSON, KARL FREDRIK
		
	Scuola / Dip.
	
			ING  - Scuola di Ingegneria Industriale e dell'Informazione
		
	Data
	
			7-ott-2021
		
	Anno accademico
	
			2020/2021
		
	Abstract in italiano
	
			Di recente, c'è stato molto interesse per le previsioni spiegabili, con la creazione di nuovi approcci di spiegabilità per modalità di dati specifiche come immagini e testo. Tuttavia, c'è una carenza di comprensione e un'esplorazione minima in termini di spiegabilità nel dominio dell'apprendimento automatico multimodale, in cui diverse modalità di dati sono fuse insieme nel modello. In questo progetto di tesi, esaminiamo due architetture di modelli multimodali, ovvero single-stream e dual-stream per il compito Visual Entailment (VE) che compromette le modalità di immagine e testo. I modelli considerati in questo progetto sono UNiversal ImageTExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) e Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Inoltre, conduciamo tre diversi esperimenti per la spiegabilità multimodale applicando la tecnica LIME (Local Interpretable Model-agnostic Explanations). I nostri risultati mostrano che UNITER ha la migliore accuratezza tra questi modelli per il problema di VE. Tuttavia, la spiegabilità di tutti questi modelli è simile.
		
	Appare nelle tipologie:
	
			Tesi di laurea Magistrale

File allegati

File	Dimensione	Formato
Explainable_Multimodal_Fusion_Jaweriah-v2.pdf accessibile in internet per tutti Dimensione 3.2 MB Formato Adobe PDF Visualizza/Apri	3.2 MB	Adobe PDF	Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/179706