A hybrid fault injection framework for image processing applications in FPGA

Nowadays there is a growing interest in using Image Processing applications in safety-/mission-critical systems, also to support control decisions. Relevant examples span from the large employment of such applications as payload in satellites and other spacecraft aimed at elaborating images on-board before transmitting to the ground station, to the recent adoption of image processing and machine learning algorithms to perform obstacle and pedestrian detection in autonomous driving systems. These classes of applications are inherently resilient to a certain degree of noise/error because i) they process data acquired from sensors, ii) the outputs are probabilistic estimates, or iii) limited deviations from the exact output might be acceptable. This peculiarity opens new challenges and opportunities when considering the stringent reliability constraints that typically characterize mission-/safety-critical application scenarios. More precisely, the impact of the faults on the quality and usability of the final result highly depends on where/when the upset occurs during the execution of the payload application. As an example, a fault may cause the final image to be severely damaged and thus unusable, whereas in other cases it could be still used either by a human user, or other classification applications. On the other hand, the assessment and enhancement of the reliability of such a kind of applications becomes much more complex and requires a more accurate analysis of the effect of the occurred faults. As a matter of fact, the classical fault injection campaigns applied on the final or prototype version of the system does not suffice in giving fast feedback to the design activities; at the opposite, it becomes relevant to perform an early reliability evaluation in the first phases of the design flow. For this reason, the purpose of this thesis is to define a hybrid cross-layer fault injection framework for complex image processing applications based on a pipeline of filters. The framework is capable at performing reliability analysis at various levels of abstraction; in particular, as classically performed, at a lower level in the system implementation and at a higher level in the algorithmic description of the application, implemented in software. The final goal of the framework is to support the designer in understanding if an image processing application is sufficiently resilient against faults, or it needs to be hardened; in the latter case, whether it suffices to focus only on some parts/step of the application pipeline, to keep overheads to a minimum (to save power, time, area) or it is mandatory to harden the entire system. The considered working scenario for designing the hybrid fault injection framework focuses on image applications implemented on reconfigurable devices, in particular Field Programmable Gate Arrays (FPGAs), and on Single Event Upsets (SEUs), that is the most frequent type of fault occurring in such a kind of device due to the effects of radiations influencing the circuits' status. The first contribution of the thesis is the design of the hybrid fault injection framework for the considered scenario, i.e. image processing applications implemented onto FPGA. The framework integrates both an architecture-level FPGA fault injector, emulating SEUs in the programmable configuration, and an application-level error simulator, based on saboteurs corrupting the output of a single step of the application pipeline. In order to focus the reliability analysis on an application step at the time, the framework allows to run the entire application in software, at application-level, thus abstracting from the underlying architectural details and to focus the accurate fault injection on the step under test. The key feature is the application-level error simulation which allows replacing the system implementation, and so save a considerable amount of time both in design and experiments execution stages. In order to enable the switch from the architecture-level fault injection to the higher application-level error simulation requires an accurate definition of the effects of the faults injected in the programmable logic of the FPGA device at the outputs of the single stages of the application pipeline. Therefore, the second contribution of the thesis is a method to define error models; such models are implemented in terms of patterns describing the corruptions caused by a fault in the 2D grid of pixels of the elaborated image (e.g. in terms of colour changes or black bands horizontal spanning the entire image). This abstraction will be integrated into the higher-level simulation environment that makes possible to replace the classical fault injection campaign with the smarter error simulation approach, that neglects all the architectural dependencies. We employed the proposed framework in a case study considering a common class of image processing algorithms widely used as steps in application pipelines, that are the convolution filters. We performed an extensive architecture-level fault injection campaign on hardware implementation of various types of convolution filters devoted to the analysis of the effects of fault injected in the FPGA on the filters' outputs. Based on an in-depth analysis of the results we defined a set of fault models, most of them are general for the overall class of convolution filters while some other one specific for the kernel. Such fault models have been finally validated with a second fault injection campaign, where the same types of error and related occurrence probability has been found. Indeed, such a modelling activity is performed only a single time on the specific application's building blocks. Once, error models are integrated into the error simulator, the designer can perform reliability analysis of any type of applications integrating such filters without any necessity to implement the hardware modules and use the FPGA fault injector. Furthermore, shifting to error simulation, the high time-consuming fault injection campaign can be replaced by software emulation without losing any functionality.

Al giorno d'oggi c'è un crescente interesse nell'utilizzo di applicazioni per l'elaborazione digitale delle immagini in sistemi critici, anche impiegati come supporto nei processi decisionali. Esempi rilevanti vanno dall'impiego di tali applicazioni di payload in satelliti dedicati all'elaborazione delle immagini prima che esse vengano trasmesse alla stazione base e ai recenti usi di algoritmi di machine learning e applicazioni di elaborazione digitale di immagini per il riconoscimento automatico di ostacoli o pedoni in sistemi che supportano la guida autonoma. Questa tipologia di applicazioni ha un'intrinseca tolleranza ad un determinato grado di inesattezza poiché i) processano dati acquisiti tramite sensori, ii) i risultati finali sono stimati in modo probabilistico, e iii) minime deviazioni dal risultato esatto possono essere tollerate. Questa caratteristica apre nuove sfide e opportunità quando vengono considerati vincoli di affidabilità altamente restrittivi che tipicamente caratterizzano le applicazioni impiegate in ambiti critici. Più precisamente, l'impatto dei guasti sulla qualità e l'usabilità del risultato finale dipende profondamente da dove/quando il guasto avviene durante l'esecuzione delle applicazioni di payload. Per esempio, un guasto può compromettere fortemente l'immagine finale e quindi renderla inutilizzabile, mentre in altri casi l'immagine, seppur corrotta, potrà essere usata da utenti o da altre applicazioni a valle. D'altra parte, la valutazione e il miglioramento dell'affidabilità di tali applicazioni diventano molto più complesse e ciò richiede un'analisi dell'effetto causato dal guasto molto più dettagliata. Difatti, la classica campagna di iniezione guasti eseguita sulla versione finale o il prototipo del sistema non fornisce un riscontro immediato alle attività di progettazione; al contrario, diviene importante eseguire un'analisi di affidabilità anticipata alle prime fasi. Per questo motivo, lo scopo della tesi è di definire un framework di iniezione guasti ibrido su più livelli per applicazioni di elaborazione delle immagini complesse basate su una pipeline di filtri. Il framework è in grado di eseguire analisi di affidabilità con vari livelli di astrazione: in particolare, come nell'approccio classico, a basso livello nell'implementazione del sistema e ad alto livello nella descrizione algoritmica dell'applicazione, implementata tramite software. L'obiettivo finale del framework è fornire un supporto ai progettisti per comprendere se un'applicazione per l'elaborazione digitale delle immagini è sufficientemente resistente contro i guasti, o necessità un irrobustimento; in questo caso, se è sufficiente concentrarsi solo su una porzione della pipeline dell'applicazione, per ridurre al minimo i costi generali (per risparmiare energia, tempo e consumo d'area) o se è obbligatorio irrobustire l'intera applicazione. Lo scenario considerato per la progettazione del framework di iniezione guasti ibrido si concentra su applicazioni implementate su dispositivi riprogrammabili, in particolare Field Programmable Gate Array (FPGA), e sui Single Event Upset (SEU), che rappresentano la causa d'errore più frequente in questo tipo di dispositivi, dovuti agli effetti indotti dalle radiazioni che influenzano gli stati dei circuiti. Il primo contributo della tesi è la progettazione del framework di iniezione guasti ibrido per lo scenario considerato, quindi applicazioni per l'elaborazione delle immagini implementate su FPGA. Il framework integra sia l'iniettore guasti a livello architetturale, emulando così i SEU nella configurazione programmabile, sia a livello applicativo con la simulazione d'errore, basata sul sabotaggio dell'uscita del singolo filtro della pipeline. Il framework permette di eseguire l'intera applicazione a livello funzionale tramite software, omettendo dettagli architetturali e concentrandosi in un'accurata analisi di affidabilità di un singolo componente alla volta. L'aspetto fondamentale è la simulazione d'errore a livello applicativo che permette di sostituire l'implementazione del sistema, e quindi di risparmiare tempo, sia nella progettazione sia nell'esecuzione degli esperimenti. Il passaggio da iniezione guasti architetturale a simulazione d'errore ad alto livello richiede un'accurata definizione degli effetti sull'uscita del singolo filtro dovuti ai guasti iniettati nella logica programmabile delle FPGA. Perciò, il secondo contributo della tesi è una metodologia di definizione di modelli d'errore; Tali modelli sono implementati in termini di effetti che descrivono come il guasto corrompa la griglia 2D dell'immagine elaborata (per esempio variazioni di colore o bande nere che attraversano orizzontalmente l'immagine). Questa astrazione sarà integrata nella simulazione d'errore che rende possibile la sostituzione della classica campagna di iniezione guasti con l'approccio emulativo, che trascura tutte le dipendenze architetturali. Abbiamo impiegato il framework proposto in un caso di studio che considera una classe di algoritmi di elaborazione delle immagini molto diffusa nelle pipeline di tali applicazioni, ossia i filtri convolutivi. Abbiamo eseguito un'estensiva campagna di iniezione guasti a livello architetturale sull'implementazione fisica dei vari filtri convolutivi, orientata all'analisi degli effetti causati sull'immagine finale. Tramite un'approfondita analisi dei risultati, abbiamo definito un insieme di modelli d'errore, molti dei quali sono generali per la classe dei filtri convolutivi mentre altri lo sono per il singolo kernel. Tali modelli di guasto sono stati infine validati con una seconda campagna di iniezione guasti, la quale ha condotto ai medesimi modelli di errore, così come le relative probabilità di occorrenza. Tale attività di modellizzazione è eseguita solo una volta per i vari componenti dell'applicazione. Una volta che i modelli di errore sono integrati con la fase di simulazione, il progettista può eseguire analisi di affidabilità di qualsiasi tipo su applicazioni che integrino tali filtri, senza la necessità di alcuna implementazione fisica del sistema e l'uso di FPGA per condurre iniezioni. Spostandoci verso la simulazione d'errore, la campagna di iniezione guasti, che è notoriamente molto dispendiosa dal punto di vista del tempo richiesto, può essere sostituita da una emulazione software, senza perdere alcuna funzionalità.