Testing, validation and adaptation for reliable deep learning systems

The increasing adoption of Deep Learning (DL) models in safety-critical systems, such as autonomous driving, has exposed a fundamental challenge: their frequent failure to generalize to new and unforeseen operational domains. This generalization gap, which arises from the models’ inability to handle out-of-distribution data, represents a critical barrier to ensuring their reliability and safety. Existing approaches to testing and adaptation often address this problem in isolation, resulting in fragmented solutions for test generation, validation, and runtime adaptation. This thesis advances the software engineering of DL systems by contributing a suite of distinct techniques that address key limitations across the generalization testing lifecycle. First, it introduces novel methods for the automated generation of diverse and realistic test scenarios, leveraging multi-modal generative AI to augment both static datasets and dynamic, physics-based simulators without the need for extensive per-domain training. Second, it presents a robust framework for the multi-modal validation of these generated test cases, assessing their semantic, structural, and geometric integrity, and introduces a novel mechanism to automatically repair flawed test inputs. Finally, it delivers a self-healing framework that enables DL models to autonomously detect and adapt to new operational domains at runtime, using an unsupervised approach to mitigate catastrophic forgetting without requiring new labeled data. The effectiveness of these contributions is demonstrated through extensive evaluations across multiple domains, including autonomous driving, medical imaging, and industrial automation. The results show that the proposed techniques not only uncover critical model weaknesses but also significantly improve the reliability of DL systems, providing a more comprehensive approach to ensuring their dependability in the real world.

La crescente adozione di modelli di Deep Learning (DL) in sistemi critici per la sicurezza, come la guida autonoma, ha messo in luce una sfida fondamentale: la loro frequente incapacità di generalizzare a domini operativi nuovi e imprevisti. Questo deficit di generalizzazione, derivante dall’incapacità dei modelli di gestire dati fuori distribuzione, rappresenta una barriera critica per garantirne l’affidabilità e la sicurezza. Gli approcci esistenti per il testing e l’adattamento spesso affrontano questo problema in modo isolato, portando a soluzioni frammentate per la generazione di test, la validazione e l’adattamento a runtime. Questa tesi contribuisce all’avanzamento dell’ingegneria del software per i sistemi di DL, proponendo una serie di tecniche distinte che affrontano le principali limitazioni lungo l’intero ciclo di vita del testing di generalizzazione. In primo luogo, introduce metodi innovativi per la generazione automatizzata di scenari di test diversificati e realistici, sfruttando l’IA generativa multimodale per arricchire sia dataset statici sia simulatori dinamici basati sulla fisica, senza la necessità di un addestramento specifico per ogni dominio. In secondo luogo, presenta un framework robusto per la validazione multimodale dei casi di test generati, valutandone l’integrità semantica, strutturale e geometrica, e introduce un meccanismo innovativo per riparare automaticamente gli input di test difettosi. Infine, fornisce un framework di auto-riparazione che permette ai modelli di AP di rilevare e adattarsi autonomamente a nuovi domini operativi a runtime, utilizzando un approccio non supervisionato per mitigare la dimenticanza catastrofica senza richiedere nuovi dati etichettati. L’efficacia di questi contributi è dimostrata attraverso una valutazione approfondita in diversi domini, tra cui la guida autonoma, l’imaging medicale e l’automazione industriale. I risultati mostrano che le tecniche proposte non solo rivelano criticità nei modelli, ma migliorano anche in modo significativo l’affidabilità dei sistemi di DL, fornendo un approccio più completo per garantirne la sicurezza operativa nel mondo reale.