On how to accelerate iterative stencil loops : a scalable streaming-based approach

In scientific computing and in general in High Performance Computing (HPC), stencil computations play a crucial role as they appear in a variety of different fields of application, ranging from Partial Differential Equations (PDEs) solving, to computer simulation of particles interaction, to image processing and Computer Vision (CV), and a lot more. The computationally intensive nature of those algorithms has created the need of good solutions to efficiently implement them, in order to save both execution time, and energy consumption. This, in combination with their regular structure, has justified a wide study and the proposal of a lot of different approaches, in which virtually every kind of computing device currently available has been explored. The work proposed in this thesis addresses Iterative Stencil Loops (ISLs) employing as enabling technology the Polyhedral Model (PM), with the aim of accelerate them using a Field Programmable Gate Array (FPGA) as target device. In particular, this research propose a streaming-based microarchitecture called Streaming Stencil Time-step (SST), able to achieve, thanks to an optimal Full Buffering (FB), a really low usage of the available resources as well as an efficient data reuse; and a technique, named SSTs queuing, able to effectively increase the throughput by a pseudo-linear factor, which exploits the characteristics of the proposed microarchitecture putting replicas of it in cascade, enabling, thanks to the streaming nature of the SSTs, a pipelined execution within the queue. The methodology has been tested with some significant benchmarks on a Virtex-7 using the Xilinx Vivado suite. Results show how the efficient usage of the on-chip memory resources realized by an SST allows to treat problem sizes whose implementation would otherwise not be possible synthesizing directly the original code via High Level Synthesis (HLS), but also how the scalability given by the SSTs queuing ensure a pseudo-linear increase in throughput, while remaining with constant bandwidth.

Nel vasto scenario della scienza computazionale e dell’High Performance Computing (HPC) in generale, le computazioni di tipo stencil giocano un ruolo fondamentale in quanto appaiono sistematicamente in una pletora di campi applicativi, spaziando dalla risoluzione di equazioni differenziali alle derivate parziali, alla simulazione dell’interazione di particelle, all’image processing e alla Computer Vision (CV), e molto altro. Data la loro natura computazionalmente pesante, nel tempo si è evidenziata la necessità di soluzioni implementative efficienti, con l’obiettivo di ridurre sia il tempo di esecuzione, che il consumo energetico. Questo, in aggiunta alla loro struttura regolare, ha giustificato un esteso studio ed una varietà di approcci proposti, in cui praticamente qualsiasi dispositivo di elaborazione attualmente disponibile è stato esplorato. Il lavoro proposto in questa tesi si focalizza sulla implementazione dei codici stencil, definiti Iterative Stencil Loops (ISLs), utilizzando come tecnologia abilitante il Polyhedral Model (PM), con l’obiettivo di accelerarli su una FPGA. In particolare, questa ricerca propone una microarchitettura streaming chiamata Streaming Stencil Time-step (SST), capace di ottenere, realizzando un Full Buffering (FB) ottimo, un basso uso delle risorse disponibili ma anche un efficace riuso dei dati; ed una tecnica, chiamata accodamento delle SST, in grado di aumentare il throughput di un fattore pseudo lineare, e che consiste nello sfruttare opportunamente le caratteristiche della microarchitettura proposta collegandone in cascata delle repliche, abilitando, grazie alla natura streaming delle SST, un’esecuzione in pipeline all’interno della coda. La metodologia è stata testata con alcuni significativi benchmark su una Virtex-7 utilizzando la suite Vivado di Xilinx. I risultati mostrano come l’efficiente utilizzo delle risorse di memoria on-chip realizzato da una SST consenta di trattare problemi la cui dimensione non ne consentirebbe l’implementazione sintetizzando via High Level Synthesis (HLS) direttamente il codice originale, nonché come la scalabilità data dall’accodamento delle SST garantisca un incremento pseudo lineare del throughput, pur restando a banda costante.