Automated detection of probabilistic behavior boundaries of ML-enabled autonomous systems

The increasing effectiveness of Machine Learning (ML) techniques has greatly influenced the autonomous systems domain. Autonomous systems with ML components have become crucial to tackling many complex tasks as they can navigate dynamic and uncertain environments, make real-time decisions, and interact autonomously with humans and other entities. However, the ML components add new challenges and risks that are hardly manageable with traditional software engineering practices because of nonlinear behaviors heavily influenced by training data. In this thesis, we introduce the concept of probabilistic frontier of behaviors: a set of inputs that cause system failures with a certain probability. These inputs are test cases that exemplify system misbehavior in a region of the input space, and domain experts can generate multiple frontiers to evaluate the quality of the system thoroughly. We propose DeepSpectrum, a search tool that explores the input space of a system to find diverse test cases on the probabilistic frontier of behaviors. DeepSpectrum leverages a model of the input space to generate realistic inputs that are human-interpretable and meaningful for the evaluation of the system. Experimental results show that the higher complexity and longer execution times of DeepSpectrum compared to other pre-existing methods are justified by its effectiveness in identifying a varied, wider test set. The ability of DeepSpectrum to explore multiple probabilistic frontiers of behaviors and steer the search towards diverse regions of the input space allows it to uncover a more comprehensive range of system failures and edge cases. The human-interpretable and meaningful test cases produced by DeepSpectrum provide valuable insights for domain experts to understand and address system vulnerabilities effectively.

La crescente efficacia delle tecniche di Machine Learning (ML) ha influenzato notevolmente il settore dei sistemi autonomi. Specificamente, I sistemi autonomi con componenti di ML sono diventati cruciali per affrontare molti compiti complessi, in quanto possono navigare in ambienti dinamici e incerti, prendere decisioni in tempo reale e interagire autonomamente con gli esseri umani e altre entità. Tuttavia, i componenti di ML aggiungono nuove sfide e rischi difficilmente gestibili con le tradizionali pratiche di ingegneria del software, a causa dei comportamenti non lineari, fortemente influenzati dai dati di addestramento. Questa tesi introduce il concetto di frontiera probabilistica dei comportamenti: un insieme di input che causano guasti al sistema con una certa probabilità. Questi input sono casi di test che esemplificano il comportamento scorretto del sistema in una regione dello spazio degli input. Gli esperti di dominio possono generare più frontiere per valutare a fondo la qualità del sistema. Viene proposto DeepSpectrum, uno strumento di ricerca che esplora lo spazio degli input di un sistema, per trovare diversi casi di test sulla frontiera probabilistica dei comportamenti. DeepSpectrum sfrutta un modello dello spazio degli input per generare casi di test realistici, facilmente interpretabili dall'uomo e significativi per la valutazione del sistema. I risultati sperimentali mostrano che la maggiore complessità e i tempi di esecuzione più lunghi di DeepSpectrum, rispetto ad altri metodi preesistenti, sono giustificati dalla sua efficacia nell'identificare un insieme di test vario e più ampio. La capacità di DeepSpectrum di esplorare più frontiere probabilistiche dei comportamenti e di orientare la ricerca verso regioni diverse dello spazio degli input gli consente di scoprire una gamma più completa di guasti e casi limite del sistema. I casi di test significativi e interpretabili dall'uomo prodotti da DeepSpectrum forniscono preziose indicazioni agli esperti del settore per comprendere e risolvere efficacemente le vulnerabilità del sistema.