On iterative and conditional computation for visual representation learning

Learning effective representations is crucial for scaling the performance of machine learning methods. Deep Neural Networks are ﬂexible models that can learn powerful hierarchical representations by stacking several layers of computations. However, once learned, adapting the representation to new data or behaviours is nontrivial. In this thesis, we take a step in the direction of learning adaptive representations for visual data addressing the problem both from a practical and theoretical perspective. First, we study Residual Networks from a dynamical system perspective and augment them with a mechanism to automatically adapt the number of processing steps based on the characteristics of the data. Then, we focus on the problem of learning effective asynchronous representations for event-based data. We propose a recurrent mechanism that automatically learns how to incrementally build a two-dimensional representation from events, which can be used as input to convolutional frame-based architectures to improve their performance on optical ﬂow prediction and image recognition tasks with respect to hand-designed features. Finally, we focus on the challenging problem of One-Shot Video Object Segmentation, where the model is asked to segment speciﬁc objects in unseen videos after observing a single annotated frame. We tackle the problem from a Meta-Learning perspective by showing that it is possible to adapt a generic meta-representation to speciﬁc task-representations, by modulating the activations of a segmentation network conditioned on the given instance.

L'apprendimento di rappresentazioni direttamente dai dati è fondamentale per migliorare le prestazioni dei metodi di apprendimento automatico. Le reti neurali sono modelli flessibili in grado di apprendere potenti rappresentazioni gerarchiche. Tuttavia, una volta appresa, adattare la rappresentazione a nuovi dati o comportamenti non è banale. In questa tesi, facciamo un passo nella direzione dell'apprendimento di rappresentazioni adattive per dati visivi che affrontano il problema sia da una prospettiva pratica che teorica. In primo luogo, studiamo le Residual Networks dal punto di vista dei sistemi dinamici aggiungendo con un meccanismo per adattare automaticamente il numero di passi di computazione in base alle caratteristiche dei dati. Successivamente, ci concentriamo sul problema dell'apprendimento di rappresentazioni asincrone per dati basati su eventi. Proponiamo un meccanismo ricorrente che impara automaticamente come costruire in modo incrementale una rappresentazione bidimensionale direttamente dagli eventi, che può essere utilizzato come input per architetture convoluzionali per migliorare le loro prestazioni su attività di predizione del flusso ottico e riconoscimento di immagini rispetto alle a features euristiche, progettate da esperti. Infine, ci concentriamo sul complesso problema della segmentazione di oggetti video One-Shot, in cui al modello viene richiesto di segmentare oggetti speciﬁci in un video dopo aver osservato un singolo fotogramma annotato. Affrontiamo il problema da una prospettiva di Meta-Learning mostrando che è possibile adattare una meta-rappresentazione generica a rappresentazioni speciﬁche per ogni task, modulando le attivazioni di una rete di segmentazione condizionata all'istanza data.