In recent years, Vision Language Models (VLMs) have made significant progress in combining computer vision and natural language processing, with promising applications in fields like autonomous driving. This report focuses on two main areas: the DriveLM challenge and Driving Theory Tests. For the DriveLM challenge, which involves Vision Question Answering (VQA) using the NuScenes dataset, we utilize the pretrained Mantis-8B-Idefics2 model. By applying zero-shot and few-shot learning techniques and fine-tuning the model for specific question types, we enhance its adaptability. Additionally, we employ a Graph Reasoning approach, leveraging previous answers to improve contextual understanding. Our results demonstrate improvements, although they have yet to surpass state-of-the-art performance. In parallel, we explore the application of VLMs to the Driving Theory Test in multiple languages, assessing road rules and traffic signs comprehension. We use similar methodologies, including grounding techniques and graph reasoning, to align visual and textual data effectively. Our integrated approach—combining pretrained models, fine-tuning, grounding, and graph reasoning—shows promise in improving both VQA and theory test performance, contributing to the development of more advanced AI systems for autonomous driving scenarios.
Negli ultimi anni, i modelli Vision Language (VLM) hanno compiuto notevoli progressi nell'integrazione tra visione artificiale ed elaborazione del linguaggio naturale, con applicazioni promettenti in settori come la guida autonoma. In questa tesi, ci concentriamo su due aree principali: la DriveLM Challenge e i test teorici di guida. Per la DriveLM Challenge, che riguarda il Vision Question Answering (VQA) utilizzando il dataset NuScenes, impieghiamo il modello pre-addestrato Mantis-8B-Idefics2. Attraverso tecniche di zero-shot e few-shot learning e la personalizzazione del modello per specifiche categorie di domande, ne miglioriamo l'adattabilità. Inoltre, utilizziamo un approccio di Graph Reasoning, che sfrutta le risposte precedenti per migliorare la comprensione contestuale. I risultati mostrano miglioramenti promettenti, anche se non raggiungono ancora lo stato dell'arte. Parallelamente, esploriamo l'applicazione dei VLM ai test teorici di guida in varie lingue, valutando la comprensione delle regole stradali e della segnaletica. Anche qui, adottiamo strategie simili, tra cui tecniche di grounding e Graph Reasoning, per allineare in modo preciso dati visivi e testuali. Il nostro approccio integrato—che combina modelli pre-addestrati, personalizzazione, grounding e Graph Reasoning—dimostra potenziale nel migliorare le prestazioni sia del VQA che dei test teorici, contribuendo allo sviluppo di sistemi di intelligenza artificiale più avanzati e affidabili per scenari di guida autonoma.
Driving with vision-language models
PETTENON, FRANCESCO
2023/2024
Abstract
In recent years, Vision Language Models (VLMs) have made significant progress in combining computer vision and natural language processing, with promising applications in fields like autonomous driving. This report focuses on two main areas: the DriveLM challenge and Driving Theory Tests. For the DriveLM challenge, which involves Vision Question Answering (VQA) using the NuScenes dataset, we utilize the pretrained Mantis-8B-Idefics2 model. By applying zero-shot and few-shot learning techniques and fine-tuning the model for specific question types, we enhance its adaptability. Additionally, we employ a Graph Reasoning approach, leveraging previous answers to improve contextual understanding. Our results demonstrate improvements, although they have yet to surpass state-of-the-art performance. In parallel, we explore the application of VLMs to the Driving Theory Test in multiple languages, assessing road rules and traffic signs comprehension. We use similar methodologies, including grounding techniques and graph reasoning, to align visual and textual data effectively. Our integrated approach—combining pretrained models, fine-tuning, grounding, and graph reasoning—shows promise in improving both VQA and theory test performance, contributing to the development of more advanced AI systems for autonomous driving scenarios.File | Dimensione | Formato | |
---|---|---|---|
2024_10_Pettenon_Tesi.pdf
accessibile in internet per tutti
Descrizione: Tesi
Dimensione
19.34 MB
Formato
Adobe PDF
|
19.34 MB | Adobe PDF | Visualizza/Apri |
2024_10_Pettenon_Executive_Summary.pdf
accessibile in internet per tutti
Descrizione: Executive Summary
Dimensione
397.83 kB
Formato
Adobe PDF
|
397.83 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/227702