Recently, there has been a lot of interest in explainable predictions, with new explainability approaches being created for specific data modalities like images and text. However, there is a dearth of understanding and minimal exploration in terms of explainability in the multimodal machine learning domain, where diverse data modalities are fused together in the model. In this thesis project, we look into two multimodal model architectures namely single-stream and dual-stream for the Visual Entailment (VE) task which compromises of image and text modalities. The models considered in this project are UNiversal ImageTExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) and Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Furthermore, we conduct three different experiments for multimodal explainability by applying the Local Interpretable Model-agnostic Explanations (LIME) technique. Our results show that UNITER has the best accuracy among these models for the problem of VE. However, the explainability of all these models is similar.
Di recente, c'è stato molto interesse per le previsioni spiegabili, con la creazione di nuovi approcci di spiegabilità per modalità di dati specifiche come immagini e testo. Tuttavia, c'è una carenza di comprensione e un'esplorazione minima in termini di spiegabilità nel dominio dell'apprendimento automatico multimodale, in cui diverse modalità di dati sono fuse insieme nel modello. In questo progetto di tesi, esaminiamo due architetture di modelli multimodali, ovvero single-stream e dual-stream per il compito Visual Entailment (VE) che compromette le modalità di immagine e testo. I modelli considerati in questo progetto sono UNiversal ImageTExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) e Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Inoltre, conduciamo tre diversi esperimenti per la spiegabilità multimodale applicando la tecnica LIME (Local Interpretable Model-agnostic Explanations). I nostri risultati mostrano che UNITER ha la migliore accuratezza tra questi modelli per il problema di VE. Tuttavia, la spiegabilità di tutti questi modelli è simile.
Explainable multimodal fusion
Alvi, Jaweriah
2020/2021
Abstract
Recently, there has been a lot of interest in explainable predictions, with new explainability approaches being created for specific data modalities like images and text. However, there is a dearth of understanding and minimal exploration in terms of explainability in the multimodal machine learning domain, where diverse data modalities are fused together in the model. In this thesis project, we look into two multimodal model architectures namely single-stream and dual-stream for the Visual Entailment (VE) task which compromises of image and text modalities. The models considered in this project are UNiversal ImageTExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) and Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Furthermore, we conduct three different experiments for multimodal explainability by applying the Local Interpretable Model-agnostic Explanations (LIME) technique. Our results show that UNITER has the best accuracy among these models for the problem of VE. However, the explainability of all these models is similar.File | Dimensione | Formato | |
---|---|---|---|
Explainable_Multimodal_Fusion_Jaweriah-v2.pdf
accessibile in internet per tutti
Dimensione
3.2 MB
Formato
Adobe PDF
|
3.2 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/179706