The exaFPGA base system. A streaming multi-FPGA system for high performance computing

The energy demand of the overall ICT industry amounts to about the 5% of the world’s total energy consumption and it is still growing considerably. High Performance Computing (HPC) systems are responsible for a large portion of the energy resources demanded by ICT. Indeed, today’s HPC heterogeneous architectures provide a great computing power, at the cost of an enormous energy consumption, which is due both to the Hardware (HW) accelerators i.e. Graphic Processing Units (GPUs) and co-processors and to other devices that are not in charge of performing heavy computations, such as the cooling infrastructure, the memory subsystem and the interconnections. Next generation supercomputing platforms will be required to deliver exascale performances, thus increasing the number of performed computations and the overall power consumption. However, the 20MW power budged set by the U.S. Department of Energy to the future HPC platforms in order to limit the total cost of the systems, forces a reduction of at least an order of magnitude to the power consumption of the currently available computing technologies. The work proposed in this thesis project addresses the aforementioned issues by employing Field Programmable Gate Arrays (FPGAs) as the computing elements for the next generation HPC systems, thus exploiting their intrinsic power efficiency. Moreover, it provides the specifications of a Cluster Node (CN) prototype for the exaFPGA Infrastructure, by implementing a high-throughput multi-FPGA pipeline. The performances of the proposed architecture scale linearly with the length of the pipeline and with the number of computing elements implemented on each FPGA. Moreover, this work enables the use of the Peripheral Component Interconnect Express (PCIe) interface and of the Aurora serial link on the 7-series FPGAs provided by Xilinx, thus simplifying the development of a working multi-FPGAHPC system. A first experimental version of the proposed architecture has been implemented with two Xilinx VC707 boards and five benchmarks have been run by employing various types of Streaming Stencil Time-steps (SSTs). Results show how the high scalability of the architecture is guaranteed by the pseudo-linear increase of the throughput and of the power efficiency. Moreover, the low resource utilization allows to reserve a considerable amount of area for the computing logic, thus implementing long SSTs queues, that ensure a pseudo-linear increase in throughput while remaining with constant bandwidth.

La richiesta di energia da parte del settore informatico ammonta complessivamente al 5% di tutta l’energia consumata nel mondo ed è in continua crescita. I sistemi High Performance Computing (HPC) sono i responsabili del consumo di una buona parte delle risorse energetiche utilizzate dal settore informatico. Infatti, le odierne architetture per il supercalcolo forniscono una grande potenza di calcolo, al costo di un enorme dispendio di energia, dovuto sia agli acceleratori Hardware (HW) inclusi nei moderni sistemi HPC, ovvero Graphic Processing Unit (GPU) e coprocessori, sia a altri dispositivi che non eseguono computazioni intensive, come l’impianto di raffreddamento, il sistema di memoria e le interconnessioni. La prossima generazione di piattaforme per il supercalcolo sarà chiamata a fornire prestazioni dell’ordine del miliardo di miliardi di operazioni floating point eseguite in un secondo. Si assisterà quindi ad un aumento del numero di calcoli eseguiti e della potenza utilizzata. Tuttavia, il limite di 20MW imposto dal Dipartimento dell’Energia degli Stati Uniti alla potenza consumata dai sistemi di supercalcolo del prossimo futuro, richiede una riduzione di almeno un ordine di grandezza della potenza consumata dalle tecnologie attuali. Il lavoro proposto in questo progetto di tesi affronta i problemi appena citati impiegando le Field Programmable Gate Array (FPGA) come elementi computazionali per i futuri sistemi di supercalcolo, avvalendosi della loro intrinseca efficienza energetica. Inoltre fornisce la specifica di un Cluster Node (CN) per l’infrastruttura del progetto exaFPGA, realizzando una pipeline multi-FPGA ad alte prestazioni. Le prestazioni dell’architettura proposta scalano linearmente con la lunghezza della pipeline e con il numero di elementi computazionali implementati su ogni FPGA. Inoltre, questo lavoro permette l’uso dell’interfaccia Peripheral Component Interconnect Express (PCIe) e della connessione seriale con Aurora sugli FPGA della serie 7 prodotti da Xilinx, semplificando notevolmente lo sviluppo di unsistema HPC multi-FPGA. Una prima versione sperimentale dell’architettura proposta è stata realizzata con due schede Xilinx VC707 e sono state eseguite cinque sessioni di test utilizzando diversi tipi di Streaming Stencil Time-step (SST). I risultati ottenuti mostrano come la notevole scalabilità dell’architettura sia garantita dall’aumento pseudo-lineare del throughput e dell’efficienza energetica. Inoltre, il ridotto utilizzo delle risorse permette di dedicare una considerevole quantità di area agli elementi computazionali, consentendo quindi di realizzare lunghe catene di SST, che comportano un aumento del throughput, pur mantenendo la banda costante.