Large Language Models (LLMs) have revolutionized the field of natural language processing, and the emergence of multi-modal LLMs has similarly transformed visual understanding. These advances, coupled with growing computational capabilities, have renewed interest in video understanding, the task of interpreting complex video content. This field has become increasingly relevant given the vast amounts of video data generated daily across domains such as social media, healthcare, surveillance, and autonomous driving, where automated analysis promises significant societal impact. Despite recent progress, video understanding remains challenging as it requires advanced spatio-temporal and causal reasoning beyond frame-level perception. Video Question Answering (VideoQA) serves as a central benchmark for measuring model capabilities in this domain, and the recent development of Video Large Language Models (Video-LLMs) has advanced the state of the art. However, Video-LLMs still exhibit critical limitations in understanding videos with complex spatio-temporal and causal relationships. Moreover, a notable performance gap persists between proprietary frontier models and open-source alternatives. This thesis explores the use of structured Spatio-Temporal Scene Graphs as an intermediate representation bridging video content and natural language queries. The work investigates how to leverage Video-LLMs to automatically generate Scene Graphs without manual annotation and evaluates their effectiveness as symbolic representations for VideoQA tasks. This approach aims to enhance both the accuracy and interpretability of Video-LLMs in complex video understanding by providing a structured representation of video content that captures key spatio-temporal relationships.
I Large Language Models (LLM) hanno rivoluzionato il campo dell’elaborazione del linguaggio naturale. In maniera analoga, l’integrazione di informazioni visuali in questi modelli ha portato la nascita di LLM multi-modali, i quali hanno a loro volta trasformato i campi della comprensione automatica di contenuti visivi. Questi progressi, unitamente alle crescenti capacità computazionali, hanno rinnovato l’interesse per la comprensione automatica di contenuti video semanticamente complessi. Questo dominio di ricerca, sta diventando sempre più rilevante data la vasta quantità di dati video generati quotidianamente in settori quali social media, sanità, sorveglianza e guida autonoma, dove l’analisi automatizzata presenta significative potenzialità. Il Video Question Answering (VideoQA) rappresenta un benchmark centrale per misurare le capacità dei modelli in questo dominio, e il recente sviluppo dei Video Large Language Models (Video-LLM) ha contribuito significativamente all'avanzamento dello stato dell'arte. Nonostante i recenti progressi, spesso ottenuti tramite Video-LLM proprietari, la comprensione video rimane comunque un dominio caratterizzato da prestazioni ancora distanti da quelle umane. Tale compito richiede difatti sofisticate capacità cognitive per decodificare le complesse relazioni spazio-temporali, che trascendono la mera percezione a livello di singolo frame. Inoltre, persiste ancora un notevole divario di prestazioni tra i modelli proprietari all'avanguardia e le alternative open-source. Questa tesi esplora l'uso di strutture quali i Scene Graphs Spazio-Temporali come rappresentazioni intermedie, collegando il contenuto video con le domande poste in forma testuale. Il lavoro indaga specificatamente come sfruttare i Video-LLM per la generazione automatica di Scene Graphs in assenza di annotazioni manuali, valutandone l'efficacia come rappresentazioni simboliche strutturate per i compiti di VideoQA. Questo approccio mira a migliorare sia l'accuratezza che l'interpretabilità dei Video-LLM nella comprensione di contenuti video complessi, fornendo una rappresentazione strutturata del contenuto che preserva e codifica le relazioni spazio-temporali fondamentali.
Video understanding with multi-modal LLMs: a neuro-symbolic approach for video question answering using generated scene graphs
Lusha, Fabio
2024/2025
Abstract
Large Language Models (LLMs) have revolutionized the field of natural language processing, and the emergence of multi-modal LLMs has similarly transformed visual understanding. These advances, coupled with growing computational capabilities, have renewed interest in video understanding, the task of interpreting complex video content. This field has become increasingly relevant given the vast amounts of video data generated daily across domains such as social media, healthcare, surveillance, and autonomous driving, where automated analysis promises significant societal impact. Despite recent progress, video understanding remains challenging as it requires advanced spatio-temporal and causal reasoning beyond frame-level perception. Video Question Answering (VideoQA) serves as a central benchmark for measuring model capabilities in this domain, and the recent development of Video Large Language Models (Video-LLMs) has advanced the state of the art. However, Video-LLMs still exhibit critical limitations in understanding videos with complex spatio-temporal and causal relationships. Moreover, a notable performance gap persists between proprietary frontier models and open-source alternatives. This thesis explores the use of structured Spatio-Temporal Scene Graphs as an intermediate representation bridging video content and natural language queries. The work investigates how to leverage Video-LLMs to automatically generate Scene Graphs without manual annotation and evaluates their effectiveness as symbolic representations for VideoQA tasks. This approach aims to enhance both the accuracy and interpretability of Video-LLMs in complex video understanding by providing a structured representation of video content that captures key spatio-temporal relationships.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_10_Lusha_Executive_Summary.pdf
accessibile in internet per tutti a partire dal 30/09/2026
Descrizione: summary
Dimensione
591.38 kB
Formato
Adobe PDF
|
591.38 kB | Adobe PDF | Visualizza/Apri |
|
2025_10_Lusha_Thesis.pdf
accessibile in internet per tutti a partire dal 30/09/2026
Descrizione: tesi
Dimensione
4.88 MB
Formato
Adobe PDF
|
4.88 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/243824