A journey towards transparent fault tolerance in embarrassingly parallel MPI applications

High-Performance Computing (HPC) is driving innovation in multiple fields and reaching exascale performance, even if fault tolerance represents an obstacle to its growth. The most common interconnection medium, MPI, does not yet provide reliable assumptions to continue execution after a fault. Recent efforts from the MPI Forum aim to address this need for the next generation of applications, but they still require in-depth knowledge of MPI to achieve resiliency to faults. Fault tolerance can be implemented with different methods, such as checkpoint/restart, but the existing frameworks still lack transparency and are not easy to integrate. This work tries to understand the issues preventing MPI from offering a transparent interface to recover from faults, starting the journey from the ambitious goal of providing a completely transparent fault recovery mechanism through C/R in generic applications by hiding faults from the application and re-creating failed nodes. Due to the complexity of low-level memory management in distributed systems and the lack of support for MPI of state-of-the-art transparent checkpoint frameworks, the initial transparency goal was relaxed. We propose a transparent fault recovery framework to enable MPI to automatically recover from failures of critical processes and continue the execution after non-critical failures. We build our work on top of the User-Level Fault Mitigation (ULFM) library and Legio, a resiliency library. We distinguish between critical and non-critical processes, ensuring that only the ones crucial to the completion of the application are restarted to lower the overhead of failures. We tested the work on a supercomputer, proving that the overhead is negligible compared to the loss of a critical rank. Finally, we discussed further evolutions of the work, which could leverage the upcoming MPI 5.0 Standard and better dynamic process management runtimes for MPI.

Il calcolo ad alte prestazioni sta guidando l'innovazione in diversi campi e raggiungendo exascale performance, ma la tolleranza ai guasti rappresenta un ostacolo limitante per la sua crescita. Il metodo di comunicazione più comune, MPI, non fornisce ancora garanzie affidabili per continuare l'esecuzione dopo un errore. I recenti sforzi dell'MPI Forum mirano a soddisfare questa esigenza per la prossima generazione di applicazioni, ma attualmente richiedono ancora una conoscenza approfondita di MPI per tollerare fallimenti in applicazioni esistenti. La tolleranza ai guasti può essere implementata con diversi metodi, come il checkpoint/restart, ma i framework esistenti non sono ancora trasparenti nell'utilizzo o facili da integrare. Questo lavoro esplora i problemi che impediscono a MPI di offrire un'interfaccia trasparente per il recupero dagli errori, partendo dall'ambizioso obiettivo di fornire un meccanismo di recupero trasparente attraverso C/R in applicazioni parallele, nascondendo i guasti e ricreando i nodi falliti. A causa della complessità della gestione della memoria a basso livello nei sistemi distribuiti e della mancanza di supporto per MPI dei framework di checkpoint trasparenti più avanzati, l'obiettivo iniziale di trasparenza è stato attenuato. Proponiamo un framework di recovery che permetta ad MPI di gestire fallimenti di processi critici, e continuare l'esecuzione durante guasti non critici. Costruiamo il framework sopra User-Level Fault Mitigation Library (ULFM) e Legio, una libreria per la tolleranza ai guasti. Il risultato distingue automaticamente nodi critici e non, creando overhead solamente per i processi cruciali al completamento dell'applicazione. Abbiamo misurato l'overhead su un supercomputer, dimostrando che è trascurabile se paragonato alla perdita di un processo critico. Abbiamo poi discusso evoluzioni della ricerca che potrebbero sfruttare lo standard MPI 5.0 e migliorie nella gestione dinamica dei processi per MPI.