Resilience against transient hardware faults is a fundamental requirement for real-time systems deployed in safety-critical domains. These faults, caused by environmental fac- tors such as cosmic radiation or electromagnetic interference, can compromise system functionality. Software-Implemented Hardware Fault Tolerance (SIHFT) techniques are a cost-effective alternative to traditional hardware-based fault-tolerance solutions, providing fault mitigation entirely through software mechanisms. Among these techniques, checkpoint-restart enables system recovery by periodically saving the execution state and restarting upon fault detection. While checkpoint-restart has been extensively studied in High-Performance Computing (HPC), its application as a SIHFT technique in resource- constrained embedded systems remains largely underexplored. This work presents a novel algorithm for optimal checkpoint selection that minimizes Worst-Case Execution Time (WCET) under the assumption of a limited number of faults. The proposed approach is implemented as a checkpoint-restart library integrated in the FreeRTOS real-time operating system, and deployed on a STM32 microcontroller. Ex- perimental evaluation demonstrates significant performance improvements, with WCET reductions up to 40% on synthetic benchmarks and average-case reductions of up to 12% on selected real-world workloads, such as neural network training and image processing algorithms.
La resilienza ai guasti hardware transitori è un requisito fondamentale per i sistemi real- time impiegati in ambiti safety-critical. Questi guasti, causati da fattori ambientali come la radiazione cosmica o le interferenze elettromagnetiche, possono compromettere il cor- retto funzionamento del sistema. Le tecniche conosciute come SIHFT, rappresentano un’alternativa economica alle soluzioni tradizionali basate solo sull’hardware, offrendo mitigazione dei guasti interamente tramite meccanismi software. Tra queste tecniche, il checkpoint-restart consente di recuperare lo stato sistema salvando periodicamente lo stato di esecuzione, e ripristinandolo in caso di rilevamento di un guasto. Sebbene il checkpoint- restart sia stato ampiamente studiato nell’ambito del calcolo ad alte prestazioni (HPC), il suo impiego in sistemi embedded, caratterizzati da risorse limitate rimane in gran parte inesplorato. Questa tesi presenta un nuovo algoritmo per la selezione ottimale dei checkpoint, in maniera tale da minimizzare il WCET, sotto l’ipotesi di un numero limitato di guasti. L’approccio proposto è stato implementato sotto forma di una libreria di checkpoint-restart integrata con il sistema operativo real-time FreeRTOS, ed installata su un microcontrol- lore STM32. La valutazione sperimentale mostra riduzioni del WCET fino al 40% su benchmark sintetici e del 12% nel caso medio su benchmark reali, quali l’addestramento di reti neurali e algoritmi di edge detection.
Minimizing worst-case checkpoint/restart overhead in real-time systems
Bevacqua, Matteo
2024/2025
Abstract
Resilience against transient hardware faults is a fundamental requirement for real-time systems deployed in safety-critical domains. These faults, caused by environmental fac- tors such as cosmic radiation or electromagnetic interference, can compromise system functionality. Software-Implemented Hardware Fault Tolerance (SIHFT) techniques are a cost-effective alternative to traditional hardware-based fault-tolerance solutions, providing fault mitigation entirely through software mechanisms. Among these techniques, checkpoint-restart enables system recovery by periodically saving the execution state and restarting upon fault detection. While checkpoint-restart has been extensively studied in High-Performance Computing (HPC), its application as a SIHFT technique in resource- constrained embedded systems remains largely underexplored. This work presents a novel algorithm for optimal checkpoint selection that minimizes Worst-Case Execution Time (WCET) under the assumption of a limited number of faults. The proposed approach is implemented as a checkpoint-restart library integrated in the FreeRTOS real-time operating system, and deployed on a STM32 microcontroller. Ex- perimental evaluation demonstrates significant performance improvements, with WCET reductions up to 40% on synthetic benchmarks and average-case reductions of up to 12% on selected real-world workloads, such as neural network training and image processing algorithms.| File | Dimensione | Formato | |
|---|---|---|---|
|
2025_07_22_Executive_summary_Bevacqua.pdf
solo utenti autorizzati a partire dal 02/07/2028
Descrizione: executive summary
Dimensione
776.11 kB
Formato
Adobe PDF
|
776.11 kB | Adobe PDF | Visualizza/Apri |
|
2025_07_22_Bevacqua.pdf
solo utenti autorizzati a partire dal 02/07/2028
Descrizione: tesi
Dimensione
8.56 MB
Formato
Adobe PDF
|
8.56 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/240533