A novel approach for error modeling in a cross-layer reliability analysis of convolutional neural networks

In the future, more and more systems will adopt AI-based computation in safety-critical applications. Convolutional Neural Networks (CNNs) are one of the pillars of this AI revolution since they can perform tasks on images that traditional computer vision algorithms are not able to perform, such as image classification, object detection and segmentation. These tasks have a notorious application in Automatic Driving Systems (ADS), which employ, for example, real-time object detection networks for recognizing street signs and obstacles from a video stream. The criticality of those systems prompted the scientific community to study in depth their reliability. Cosmic rays entering the atmosphere are an insidious physical phenomenon that can thwart the robustness of these systems by silently causing temporary bit-flips in the memory elements in the hardware, altering the course of the computation. Graphic Processing Units (GPUs), the preferred hardware platforms for running CNNs, are very vulnerable to cosmic radiation due to their high density of transistors and lack of protection mechanisms in their chip area. This makes it necessary to perform a reliability analysis at the early stage of the design of this class of systems. In the literature, various frameworks perform reliability analysis of CNNs; one of these is CLASSES, a cross-layer methodological framework that combines platform-level fault injection with an application-level error simulator. The interface between the two abstraction levels in CLASSES is constituted by the error models extracted from the results of the injections and used for generating errors in the simulation. Error models are a crucial part of the CLASSES methodology and need to represent with an acceptable fidelity the errors coming from fault injection. In the first version of CLASSES, the structure of the error models is quite rudimentary and can limit the fidelity of the generated error, especially when dealing with error patterns not present in the original work. Additionally, a big part of the error modeling process is manual, making the definition of those models a long task. The goal of this work is to improve the CLASSES framework by removing the weaknesses discussed above. The first contribution of this work is the definition of a new structure of the error models, introducing a refined representation of the spatial and domain distributions of the corrupted values emerging from the outputs of the fault injection. These changes have the goal of increasing the fidelity of the simulation while making the models easily reusable for multiple application analysis. Another contribution is a new systematic approach to the process of defining the error models. We rationalized the process of extraction of the operators from the CNN under test and the planning of the fault injection experiments. We designed and implemented a software tool that visualizes the error patterns emerging from the results of the fault injection. Starting from a first empirical analysis of visualized error patterns, the user defines a classification of the corrupted outputs in spatial classes that they define using code. The code definition of the classes is then used by the tool to generate the error models. While still being semi-automatic, the proposed workflow leaps towards the complete automation of the error model generation process. The methodological framework we propose as a whole still maintains the flexibility that is needed to make it operate with different CNNs and with various underlying hardware architectures. Finally, to validate the proposed methodology and show the flexibility of the revised framework in producing various insights on the reliability of the network under test, we applied the proposed improvements by designing an injection campaign and a subsequent application analysis on YOLOv3, a state-of-the-art object detection network. In the analysis, we evaluate the vulnerability of the different operators and layers that constitute the network and we study how the different domain and spatial distribution of the errors in the feature maps affect the reliability of the network.

In futuro sempre più sistemi adotteranno la computazione basata sull'intelligenza artificiale (IA) in applicazioni critiche per la sicurezza. Le reti neurali convolutive (RNC) sono uno dei pilarstri di questa rivoluzione basata sull'IA. Infatti le RNC sono in grado di eseguire compiti, come la classificazione e il rilevamento di oggetti nelle immagini, che i tradizionali algoritmi di visione artificiale non riescono a svolgere. Queste attività hanno una ben nota applicazione nei Sistemi a Guida Autonoma (SGA) che impiegano, per esempio, reti neurali per rilevamento di oggetti in tempo reale, con lo scopo di individuare segnali e ostacoli dal flusso video proveniente da telecamere. La criticità di questi sistemi ha mosso la comunità scientifica nello studiare in profondità la loro affidabilità. Per esempio, i raggi cosmici che entrano nell'atmosfera sono un insidioso fenomeno fisico che può minacciare la robustezza di questi sistemi causando silenziosamente delle inversioni di bit negli elementi di memoria all'interno dell'hardware, alterando il corso della computazione. Le Unita di Elaborazione Grafica (UEG), sono l'hardware più adatto per eseguire le RNC, tuttavia sono molto vulnerabili alle radiazioni cosmiche a causa dell'alta densità di transistor al loro interno e della mancanza di meccanismi di protezione all'interno dell'area del chip. Questi problemi rendono necessario uno studio dell'affidabilità di questi sistemi nelle prime fasi della progettazione. Nella letteratura sono presenti diversi framework il cui obbiettivo è eseguire una analisi dell'affidabilità anticipata delle RNC. Uno di questi framework è CLASSES, un framework metodologico cross-strato che combina l'iniezione di guasti a livello di piattaforma con una simulazione degli errori a livello dell'applicazione. L'interfaccia tra i due livelli di astrazione in CLASSES è costituita da i modelli di errore estratti dai risultati delle iniezioni e utilizzati per generare errori nella simulazione. I modelli di errore sono una parte cruciale della metodologia di CLASSES perciò necessitano di rappresentare con un livello di fedeltà gli errori provenienti dalle iniezioni di guasto. Nella prima versione di classes, la struttura dei modelli di errore è rudimentale e perciò limita la fedeltà degli errori generati specialmente nel momento in cui emergono nuovi pattern dalle iniezioni di guasto. Inoltre una buona parte del processo di modellazione degli errori è manuale, rendendo la definizione dei modelli di errore un compito lungo. L'obbiettivo di questo lavoro è quello di migliorare CLASSES risolvendo i problemi appena elencati. Il primo contributo di questo lavoro sta nella definizione di una nuova struttura dei modelli di errore, introducendo una rappresentazione rifinita delle distribuzioni spaziali e dei domini dei valori corrotti provenienti dalle iniezioni di guasto. Questi cambiamenti hanno l'obbiettivo di incrementare la fedeltà della simulazione rispetto ai risultati delle iniezioni e allo stesso tempo permettere il riutilizzo dei modelli in analisi di altre applicazioni. Un altro contributo di questo lavoro è un nuovo e sistematico approccio al processo di definizione dei modelli di errore. Abbiamo, infatti, razionalizzato il processo di estrazione degli operatori dalla RNC sotto analisi e abbiamo cambiato il modo in cui le campagne di iniezioni vengono pianificate. Abbiamo anche progettato e implementato uno strumento software che visualizza i pattern di errore che emergono dai risultati delle iniezioni. Iniziando quindi da una prima empirica analisi delle distribuzioni spaziali degli errori, l'utente definisce una classificazione dei vari output provenienti dalle iniezioni di guasto definendo le varie classi con del codice, che verrà utilizzato per generare i modelli di errore. Sebbene questo processo rimane non del tutto automatico, le procedure proposte fanno un passo avanti verso la completa automazione del processo di generazione dei modelli di errore. Il framework metodologico che proponiamo, nella sua interezza, mantiene un buon grado di flessibilità, necessaria per rendere il framework compatibile ad operare con diverse RNC e con diverse architetture hardware. In conclusione, per validare la metodologia proposta e per mostrare il grado di flessibilità del framework nel produrre varie informazioni sull'affidabilità della RNC testata, abbiamo applicato i miglioramenti proposti nel design di una campagna di iniezione guasti e una successiva analisi della applicazione sulla rete YOLOv3, una rete di rilevamento oggetti al livello dello stato dell'arte. Nell'analisi abbiamo studiato la vulnerabilità dei diversi operatori ai guasti e abbiamo studiato come differenti distribuzioni spaziali e di dominio degli errori possono avere effetto sull'affidabilità della rete.