Ionizing particles in the atmosphere may strike registers and memory cells causing Single Event Upsets (SEU), temporarily tampering the output correctness. SEUs could be dangerous for critical systems, in which a fault could lead to unacceptable consequences. Critical systems are traditionally custom-designed, typically featuring hardware redundancy for guaranteeing fault resilience. Additionally, they are often real-time, requiring to meet strict timing constraints. The downsides of such custom systems are typically weight, power, energy, space, and cost, compared to Commercial Off-the-Shelf (COTS) solutions. In this thesis, we explored the use of COTS in critical real-time environments by designing a CPU-FPGA heterogeneous system, which features an ARM CPU, running a modified version of a real-time operating system, and an FPGA, on which the fault-detector and the scheduler are synthesized, in a redundant configuration for increasing fault resiliency. Moving the scheduler to the FPGA increases its fault resiliency while removing the periodic scheduler execution overhead; similarly, synthesizing the fault detector on the FPGA allows the execution of the fault detection in a fault-tolerant way without wasting CPU time. Transient fault resiliency in application tasks is achieved via fault detection and the subsequent fault recovery via re-execution. The fault detector implemented on FPGA uses a machine learning technique to learn the behaviour of tasks (offline and possibly online) and analyses it during their execution. Regarding fault recovery, the scheduler on the FPGA features a novel mixed-criticality scheduling algorithm that manages re-executions, ensuring the meeting of tasks' timing constraints. The fault detection showed noticeable results while providing a lower overhead than general-purpose software techniques for improving fault resiliency. To the best of our knowledge, the integrated CPU-FPGA version of the system, featuring fault tolerance and real-time scheduling, is a novel contribution that may enable the use of low-cost and fast COTS components in critical real-time environments.
Le particelle ionizzanti nell'atmosfera possono colpire i registri e le celle di memoria causando Single Event Upsets (SEU), che possono compromettere temporaneamente la correttezza dell'output. I SEU sono potenzialmente pericolosi per i sistemi critici, in cui un guasto può causare conseguenze inaccettabili. I sistemi critici sono tradizionalmente progettati su misura e dotati di ridondanza hardware per garantire resilienza ai guasti. Inoltre, essi sono spesso real-time, cioè richiedono il rispetto di stringenti vincoli temporali. Gli svantaggi di queste piattaforme computazionali ad-hoc sono in genere peso, potenza, energia, spazio e costo, rispetto alle soluzioni Commercial-Off-The-Shelf (COTS). In questa tesi, abbiamo esplorato l'uso di COTS nei sistemi critici e real-time, progettando un sistema CPU-FPGA con una CPU ARM, che esegue un sistema operativo real-time, e una FPGA, su cui sono sintetizzati il rilevatore di guasti e lo scheduler. Entrambi sono implementati in una configurazione ridondante per aumentare la loro stessa tolleranza ai guasti. Inoltre, spostare lo scheduler sulla FPGA ci ha permesso di ridurre l'overhead dell'esecuzione periodica su CPU. Allo stesso modo, sintetizzare il rilevatore di guasti su FPGA ci ha permesso di eseguire il rilevamento dei guasti in parallelo alla CPU, riducendo così l'impatto in termini di tempo utile della CPU. La tolleranza ai guasti nelle applicazioni viene ottenuta tramite il rilevamento ed eventualmente ripristino. Il rilevatore di guasti su FPGA apprende il comportamento dei task (offline ed eventualmente online) e lo analizza. Il ripristino è invece gestito dallo scheduler, tramite un algoritmo di scheduling che orchestra le riesecuzioni rispettando i vincoli temporali dei task. Il rilevamento dei guasti ha mostrato notevoli risultati, fornendo al contempo un overhead inferiore rispetto allo stato dell'arte. Questa versione integrata di CPU-FPGA che implementa fault tolerance e scheduling real-time è innovativa e apre la strada a diversi lavori futuri verso l'uso di componenti COTS a basso costo e veloce in applicazioni real-time critiche.
HeterogeneousRTOS: A CPU-FPGA fault-tolerant real-time operating system
RATTI, FRANCESCO
2021/2022
Abstract
Ionizing particles in the atmosphere may strike registers and memory cells causing Single Event Upsets (SEU), temporarily tampering the output correctness. SEUs could be dangerous for critical systems, in which a fault could lead to unacceptable consequences. Critical systems are traditionally custom-designed, typically featuring hardware redundancy for guaranteeing fault resilience. Additionally, they are often real-time, requiring to meet strict timing constraints. The downsides of such custom systems are typically weight, power, energy, space, and cost, compared to Commercial Off-the-Shelf (COTS) solutions. In this thesis, we explored the use of COTS in critical real-time environments by designing a CPU-FPGA heterogeneous system, which features an ARM CPU, running a modified version of a real-time operating system, and an FPGA, on which the fault-detector and the scheduler are synthesized, in a redundant configuration for increasing fault resiliency. Moving the scheduler to the FPGA increases its fault resiliency while removing the periodic scheduler execution overhead; similarly, synthesizing the fault detector on the FPGA allows the execution of the fault detection in a fault-tolerant way without wasting CPU time. Transient fault resiliency in application tasks is achieved via fault detection and the subsequent fault recovery via re-execution. The fault detector implemented on FPGA uses a machine learning technique to learn the behaviour of tasks (offline and possibly online) and analyses it during their execution. Regarding fault recovery, the scheduler on the FPGA features a novel mixed-criticality scheduling algorithm that manages re-executions, ensuring the meeting of tasks' timing constraints. The fault detection showed noticeable results while providing a lower overhead than general-purpose software techniques for improving fault resiliency. To the best of our knowledge, the integrated CPU-FPGA version of the system, featuring fault tolerance and real-time scheduling, is a novel contribution that may enable the use of low-cost and fast COTS components in critical real-time environments.File | Dimensione | Formato | |
---|---|---|---|
2023_05_Ratti_Tesi_01.pdf
Open Access dal 18/04/2024
Descrizione: Tesi
Dimensione
6.56 MB
Formato
Adobe PDF
|
6.56 MB | Adobe PDF | Visualizza/Apri |
2023_05_Ratti_Executive summary_02.pdf
Open Access dal 18/04/2024
Descrizione: Executive Summary
Dimensione
469.84 kB
Formato
Adobe PDF
|
469.84 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/212872