Porting an HPC multiphysics finite element method code to GPU through Openacc

Modern HPC systems increasingly rely on heterogeneous CPU/GPU nodes, making perfor- mance portability a central requirement for production-grade simulation codes. This thesis addresses the GPU acceleration of the implicit Advection–Diffusion–Reaction (ADR) work- flow in Alya, a large-scale multiphysics finite-element code developed at the Barcelona Su- percomputing Center. The work targets the assembly stage, which is a major cost driver in many implicit runs due to its gather/compute/scatter structure on unstructured data and its sen- sitivity to memory bandwidth and write conflicts. A single-source porting strategy based on OpenACC is presented, designed to preserve Alya’s modular architecture and object-oriented, pointer-rich data structures. Assembly is reorganized around packs of homogeneous elements: a coarse-grain loop over packs exposes scalable parallel work, while a fine-grain lane dimension provides the portability bridge, mapping to SIMD on CPUs and to SIMT threads on GPUs. To enable correct device execution with nested derived types, the thesis details a practical deep- copy workflow based on explicit data regions and systematic pointer attachment. Performance is evaluated on MareNostrum 5 accelerated nodes with NVIDIA H100 GPUs. Results show that the OpenACC pack parameter (VECTOR_SIZE) acts as a granularity knob: in 2D, larger packs reduce launch/runtime overheads and yield strong assembly speedup gains, while in 3D assem- bly quickly becomes bandwidth-oriented and further increases provide diminishing returns and may trigger resource-pressure effects.

I moderni sistemi HPC adottano sempre più nodi eterogenei CPU/GPU, rendendo la portabil- ità prestazionale un requisito chiave per i codici di simulazione in produzione. Questa tesi af- fronta l’accelerazione su GPU del workflow implicito di Advection–Diffusion–Reaction (ADR) in Alya, codice multifisico ad elementi finiti sviluppato al Barcelona Supercomputing Center. L’attenzione è rivolta alla fase di assembly, spesso dominante nei run impliciti per via della struttura gather/compute/scatter su dati irregolari e della forte dipendenza da banda di memoria e conflitti in scrittura. Viene proposta una strategia single-source basata su OpenACC, pensata per mantenere l’architettura modulare di Alya e le sue strutture dati object-oriented ricche di puntatori. L’assembly viene riorganizzato in pack di elementi omogenei: un livello esterno sui pack espone parallelismo coarse-grain, mentre una dimensione interna di lane funge da ponte di portabilità, mappandosi a SIMD su CPU e a thread SIMT su GPU. Per garantire la corret- tezza con tipi derivati annidati, la tesi descrive un workflow di deep copy basato su regioni dati esplicite e su una procedura sistematica di attach dei puntatori. Le prestazioni sono valutate su nodi accelerati di MareNostrum 5 con GPU NVIDIA H100. I risultati mostrano che il parametro di pack OpenACC (VECTOR_SIZE) controlla la granularità: in 2D pack più grandi riducono overhead di launch/runtime e aumentano significativamente lo speedup dell’assembly; in 3D l’assembly entra rapidamente in un regime limitato dalla banda e ulteriori incrementi offrono benefici marginali, talvolta penalizzati da effetti di pressione sulle risorse.