OFA multitask: a General Purpose Multitasking Vision and Language Model

We, as human beings, have the innate ability to process multiple modalities—the realworld is inherently multimodal. The progression towards multimodal learning in AI has the potential to unfold the long-standing endeavor of the science to move away from statistical analytics of a single modality (such as images, text or speech) towards multifaceted understanding of multiple modalities and their interaction. Humans learn by using their five senses: sight, hearing, touch, smell and taste. Accordingly, to be able to match human abilities, a general AI, will plausibly need the ability to process and reason over different type of input coming from different type of modality, like images, text, audio or video. To this regard, in recent years the field of multimodal AI has received huge attention by the general public and researchers thanks to the development of a new generation of models which are able to perform tasks with such good performances that were simply unimaginable just few years ago. In particular, in this work we will focus on a vision and Language model called OFA. This model is a successful attempt of a unifying architecture able to process images and text and capable to perform a variety of tasks ranging from image generation, classification or text generation all with the same architecture and without introducing ad-hoc parameters. Despite the remarkable performance of this model we can single out some weaknesses that one would not expect from a proper "unifying" model. In particular, the ability of OFA to understand the textual input prompts (the actual input fed to the model to perform a task) is quite limited and, in addition, the performance of the pretrained model on a different range of multimodal and unimodal taks is quite poor whiuthout a specific fine-tuning. The purpose of this thesis is therefore to find a way to overcome these limitations and assess if it is possible to have a model which is able to tackle the vast majority of existing NLP and vision and language tasks with the same architecture and the same set of parameters. To this end, in this work we perform a multitask training in which the model is challenged with a total of 13 different tasks each one with varying input prompts and found that the resulting model, which we call OFA multitask, is able to achieve robustness to prompts and good performances in lots of different tasks without the need of task specific finetuning.

Noi, in quanto esseri umani, abbiamo la capacità innata di elaborare informazioni da molteplici sensi: il mondo reale è intrinsecamente multimodale. La progressione delle inteligenze artificiali nell’apprendimento multimodale ha il potenziale di realizzare il tanto bramato passagio dall’analisi statistica di una singola modalità (come immagini, testo o audio) a una comprensione multiforme di più modalità e della loro interazione. Gli esseri umani imparano usando i loro cinque sensi: vista, udito, tatto, olfatto e gusto. Di conseguenza, per essere in grado di abbinare le capacità umane, un’AI generale avrà bisogno plausibilmente della capacità di elaborare e ragionare su diversi tipi di input provenienti da diversi tipi di modalità, come immagini, testo, audio o video. A questo proposito, negli ultimi anni il campo dell’AI multimodale ha ricevuto una grande attenzione da parte del pubblico e dei ricercatori grazie allo sviluppo di una nuova generazione di modelli in grado di svolgere compiti con prestazioni così buone da essere semplicemente inimmaginabili solo pochi anni fa . In particolare, in questo lavoro ci concentreremo su un modello di visione e linguaggio chiamato OFA. Questo modello rappresenta un tentativo di realizzare un’architettura unificante in grado di elaborare immagini e testo e in grado di eseguire una varietà di compiti che vanno dalla generazione di immagini, classificazione o generazione di testo, tutte con la stessa architettura e senza introdurre parametri ad-hoc. Nonostante le notevoli prestazioni di questo modello possiamo individuare alcuni punti deboli che non ci si aspetterebbe da un vero e proprio modello "unificante". In particolare, la capacità di OFA nel comprendere le istruzioni testuali di input (l’effettivo input fornito al modello per svolgere un compito) è piuttosto limitata e, inoltre, le prestazioni del modello preaddestrato su una gamma diversa di compiti multimodali e unimodali sono piuttosto scarse senza una messa a punto specifica. Lo scopo di questa tesi è quindi quello di trovare un modo per superare questi limiti e valutare se sia possibile ottenere un modello in grado di affrontare la stragrande maggioranza dei compiti esistenti di Natural Language Processing, Computer Vision e Vision and Language con la stessa architettura e lo stesso insieme di parametri. A tal fine, in questo lavoro abbiamo performato un addestramento multitasking in cui il modello viene messo alla prova con un totale di 13 compiti diversi, ciascuno con diversi prompt di input, e abbiamo scoperto che il modello risultante, che chiamiamo OFA multitask, è in grado di raggiungere una notevole robustezza nella comprensione delle istruzioni in input e buone prestazioni in molteplici compiti senza la necessità di una messa a punto specifica per ognuno di essi.