To date, the combination of Deep Learning and Computer Vision presents itself as a valid, albeit still experimental, solution with the aim of automating the understanding of the visual content of images. The development of Object Detectors such as Faster R-CNN and YOLO translates into a notable advancement in the field of perception and localization of the salient elements present in visual scenes. Artificial Intelligence (AI) can outline boxes, as selective as possible, that identify machines, people, trees or anything else, in any image. Yet, this is not enough to exhaust the concept of "visual perception"; what is missing? Identifying entities in images answers the questions "Where" and "What", what remains open is the "Why". Human beings understand that a flower and a vase are jointly present on the scene because one serves as a support for the other; more generally, we get closer to the raison d'etre of objects if we conceive them not as single, but as interconnected with each other. It is on this consideration that the research area of Scene Graph Generation (SGG) establishes its foundations, the objective of which is to provide a structured representation that builds the mutual connections between the salient elements of a visual scene. In detail, the researchers develop neural architectures capable of outputting a directed graph, called "scene-graph", whose nodes are the objects present in the image, and such that they are interconnected via arcs named by a specific class of predicate. Thus, the building blocks that define a scene graph are triplets (subject, predicate, object), such as (person, sits on, chair). Limited to the Predicate Classification (PredCls) sub-task, whose goal is to classify the predicate that links the pair (subject, object), the state of the art sees Scene Graph Generation as a continuation of Object Detection. The paradigm follows the use and manipulation of multimodal features returned by an Object Detector such as Faster R-CNN, which are then given as input to the predicate classification pipeline. The multimodalities considered are of a visual, spatial and semantic nature, and modern Scene Graph Generation methods attempt to develop the synergy between them through solutions that go beyond the simple and ineffective concatenation of the same. Among the modern methods that are concerned with best building cooperation between visual, spatial and semantic content, there are those that exploit context-based components such as BiLSTM or Transformer. It is believed, indeed, that the understanding of the relationships between objects passes through an imperceptible medium called "context", and that it can be further used as a refinement mechanism for the multimodalities considered above. Our work questions this consideration and demonstrates that existing context-based Scene Graph Generators do not exhaustively exploit the contextual property of features reworked by BiLSTMs or Transformers. Referring to the multimodalities considered, we observe that the context does not present itself as an effective improvement of the visual-spatial synergy. To reach these conclusions, we rely on a carefully chosen context-based baseline and analyze its performance on a dataset known in the literature. We show, through qualitative and quantitative analyses, deficits of a spatial nature found in the predicted relationships. We exploit this finding to direct the subsequent ablation study, whose modus operandi involves the removal or alteration of the architectural components. Our interventions are aimed at building a context that is as representative as possible of the cooperation between the visual and spatial domains; the results will demonstrate that entrusting this burden to the context is superfluous, or even counterproductive.
Ad oggi, il connubio fra Deep Learning e Computer Vision si propone come soluzione valida, seppur ancora sperimentale, nell'intento di automatizzare la comprensione del contenuto visivo delle immagini. Lo sviluppo di Object Detectors quali Faster R-CNN e YOLO si traduce in un notevole avanzamento nell'ambito della percezione e localizzazione degli elementi salienti presenti nelle scene visive. L'Intelligenza Artificiale (IA) può delineare riquadri, quanto più selettivi possibile, che identifichino macchine, persone, alberi o quant'altro, in qualsivoglia immagine. Eppure, ciò non basta per esaurire il concetto di "percezione visiva"; cosa manca? Individuare le entità nelle immagini risponde alle domande "Dove" e "Cosa", quel che rimane aperto è il "Perchè". Gli esseri umani comprendono che un fiore e un vaso sono congiuntamente presenti in scena perché uno serve da supporto all'altro; più in generale, ci avviciniamo alla ragion d'essere degli oggetti se li concepiamo non come singoli, bensì come interconnessi fra loro. E' su questa considerazione che si fonda l'ambito di ricerca della Scene Graph Generation (SGG), il cui obiettivo è fornire una rappresentazione strutturata che costruisca le reciproche connessioni fra gli elementi salienti di una scena visiva. Nel dettaglio, i ricercatori sviluppano architetture neurali in grado di restituire in output un grafo diretto, detto "scene-graph", che abbia per nodi gli oggetti presenti nell'immagine, i quali sono inter-connessi tramite archi denominati da una specifica classe di predicato. Così, i building-blocks che definiscono un scene graph sono triplette (soggetto,predicato,oggetto), come ad esempio (persona,siede su,sedia). Limitatamente alla sotto-task di Predicate Classification (PredCls), il cui goal è classificare il predicato che lega la coppia (soggetto,oggetto), lo stato dell'arte vede la Scene Graph Generation come un proseguo dell'Object Detection. Il paradigma segue l'impiego e la manipolazione di features multimodali restituite da un Object Detector come Faster R-CNN, le quali vengono date in input alla pipeline di classificazione del predicato. Le multimodalità considerate sono di natura visiva, spaziale e semantica, e i moderni metodi di Scene Graph Generation tentano di sviluppare la sinergia fra esse tramite soluzioni che esulino dal semplice e poco efficace concatenamento delle stesse. Fra i metodi moderni che si preoccupino di costruire al meglio la cooperazione fra contenuto visivo, spaziale e semantico, vi sono quelli che sfruttano componenti context-based come BiLSTM o Transformer. Si ritiene, infatti, che la comprensione delle relazioni fra gli oggetti passi attraverso un medium impercettibile chiamato "contesto", e che esso possa essere ulteriormente utilizzato come meccanismo di refinement per le multimodalità sopra considerate. Il nostro lavoro mette in dubbio tale considerazione, dimostra che i Scene Graph Generator esistenti basati sul contesto non sfruttino esaustivamente la proprietà contestuale delle features rimaneggiate da BiLSTMs o da Transformers. Riferendoci alle multimodalità considerate, osserviamo che il contesto non si pone come miglioria efficace della sinergia visivo-spaziale. Per raggiungere tali conclusioni, ci affidiamo ad una baseline context-based accuratamente scelta e ne analizziamo le performance su un dataset noto in letteratura. Mostriamo, mediante analisi qualitative e quantitative, deficit di natura spaziale riscontrabili nelle relazioni predette. Sfruttiamo tale finding per direzionare il susseguente studio di ablazione, il cui modus operandi prevede rimozioni o alterazioni delle componenti architetturali. Le nostre interventions sono mirate alla costruzione di un contesto quanto più rappresentativo della cooperazione fra dominio visivo e spaziale; i risultati dimostreranno che affidare al contesto tale onere è superfluo, o addirittura controproducente.
The impact of spatio-visual awareness in context-based Scene Graph Generation
Scozzari, Salvatore
2023/2024
Abstract
To date, the combination of Deep Learning and Computer Vision presents itself as a valid, albeit still experimental, solution with the aim of automating the understanding of the visual content of images. The development of Object Detectors such as Faster R-CNN and YOLO translates into a notable advancement in the field of perception and localization of the salient elements present in visual scenes. Artificial Intelligence (AI) can outline boxes, as selective as possible, that identify machines, people, trees or anything else, in any image. Yet, this is not enough to exhaust the concept of "visual perception"; what is missing? Identifying entities in images answers the questions "Where" and "What", what remains open is the "Why". Human beings understand that a flower and a vase are jointly present on the scene because one serves as a support for the other; more generally, we get closer to the raison d'etre of objects if we conceive them not as single, but as interconnected with each other. It is on this consideration that the research area of Scene Graph Generation (SGG) establishes its foundations, the objective of which is to provide a structured representation that builds the mutual connections between the salient elements of a visual scene. In detail, the researchers develop neural architectures capable of outputting a directed graph, called "scene-graph", whose nodes are the objects present in the image, and such that they are interconnected via arcs named by a specific class of predicate. Thus, the building blocks that define a scene graph are triplets (subject, predicate, object), such as (person, sits on, chair). Limited to the Predicate Classification (PredCls) sub-task, whose goal is to classify the predicate that links the pair (subject, object), the state of the art sees Scene Graph Generation as a continuation of Object Detection. The paradigm follows the use and manipulation of multimodal features returned by an Object Detector such as Faster R-CNN, which are then given as input to the predicate classification pipeline. The multimodalities considered are of a visual, spatial and semantic nature, and modern Scene Graph Generation methods attempt to develop the synergy between them through solutions that go beyond the simple and ineffective concatenation of the same. Among the modern methods that are concerned with best building cooperation between visual, spatial and semantic content, there are those that exploit context-based components such as BiLSTM or Transformer. It is believed, indeed, that the understanding of the relationships between objects passes through an imperceptible medium called "context", and that it can be further used as a refinement mechanism for the multimodalities considered above. Our work questions this consideration and demonstrates that existing context-based Scene Graph Generators do not exhaustively exploit the contextual property of features reworked by BiLSTMs or Transformers. Referring to the multimodalities considered, we observe that the context does not present itself as an effective improvement of the visual-spatial synergy. To reach these conclusions, we rely on a carefully chosen context-based baseline and analyze its performance on a dataset known in the literature. We show, through qualitative and quantitative analyses, deficits of a spatial nature found in the predicted relationships. We exploit this finding to direct the subsequent ablation study, whose modus operandi involves the removal or alteration of the architectural components. Our interventions are aimed at building a context that is as representative as possible of the cooperation between the visual and spatial domains; the results will demonstrate that entrusting this burden to the context is superfluous, or even counterproductive.| File | Dimensione | Formato | |
|---|---|---|---|
|
Salvatore Scozzari, The impact of spatio-visual awareness in context-based Scene Graph Generation.pdf
accessibile in internet solo dagli utenti autorizzati
Descrizione: Tesi Finale
Dimensione
8.78 MB
Formato
Adobe PDF
|
8.78 MB | Adobe PDF | Visualizza/Apri |
|
Salvatore Scozzari Executive Summary, The impact of spatio-visual awareness in context-based Scene Graph Generation.pdf
accessibile in internet solo dagli utenti autorizzati
Descrizione: Tesi Finale
Dimensione
389.93 kB
Formato
Adobe PDF
|
389.93 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/223679