Evaluating OpenCL programming practices for FPGAs : a case study on symmetric block ciphers

The landscape of the modern computing world is showing clear signs of a shift from the traditional, CPU-based model towards heterogeneous architectures which are at the same time heavily parallel and energy efficient; the recent introduction of Field-Programmable Gate Array (FPGA) devices capable of running OpenCL applications is an attempt to fill this particular position. However, the platform-independent nature of OpenCL makes it necessary to study and develop special optimization techniques which vary largely depending on the target device. Relatively to older architectures, such as GPUs, the academic effort to explore useful OpenCL programming practices for FPGAs is at its beginning. In our paper, we will study the design and implementation of a series of symmetric block ciphers, showing how different design choices in the various components of the OpenCL application influence the performance of the corresponding implementation. The practices studied encompass modifications which reduce the computation time, improve the memory usage and make memory transfers more efficient. The performance is gauged using metric such as throughput, area usage, compilation time/memory consumption and various indicators obtained via static analysis. During the course of our experiments, we verified how loop unrolling (both automatic and manual) positively affects the measured performance; optimizations making use of special OpenCL memory classes, such as local and private memory, did also prove very useful. We observed how using the techniques described above incurs in a certain cost in term of FPGA resources consumption. We verified that, in the particular case of symmetric block ciphers, there is no need to adopt special patterns in order to efficiently access global memory objects. Ultimately, we came to terms with the limitations created by the transmission channel between the main system and the OpenCL accelerator device, and we measured how a double-buffering scheme allows to work with those limitations, achieving efficient channel usage with no extra cost in terms of FPGA area. Using the data gathered in the final benchmark of our block cipher implementations, we extrapolated a series of speculative throughput upper-bounds, thus getting a glimpse of the potential performance on a system not affected by the channel limitations described above.

Il panorama dei moderni sistemi di calcolo sta mostrando chiari segni di spostamento dal tradizionale, CPU-centrico modello verso architetture eterogenee, caratterizzate allo stesso tempo da elevato parallelismo ed efficienza energetica; la recente introduzione di dispositivi Field-Programmable Gate Array (FPGA) con supporto per applicazioni OpenCL è un tentativo di occupare questa posizione. Tuttavia, la natura multi-piattaforma di OpenCL rende necessario lo studio e lo sviluppo di speciali tecniche di ottimizzazione, tecniche che variano grandemente a seconda del dispositivo in esame. Relativamente a precedenti architetture con capacità OpenCL, come le GPU, gli sforzi accademici nell'esplorazione di tecniche di programmazione per OpenCL su FPGA sono agli inizi. Nel nostro scritto, studieremo la progettazione e l'implementazione di una serie di cifrari simmetrici a blocchi, e mostreremo come differenti scelte nello sviluppo di varie componenti dell'applicazione OpenCL influenzano la performance della rispettiva implementazione. Le tecniche studiate comprendono modifiche atte a ridurre il tempo di calcolo, migliorare l'uso della memoria e rendere i trasferimenti di memoria più efficienti. Il rendimento è misurato usando metriche come il throughput, il consumo di area, il tempo/memoria consumato in fase di compilazione e vari indicatori ottenuto per mezzo di analisi statica. Nel corso dei nostri esperimenti, abbiamo verificato come il loop unrolling (sia automatico che manuale) influisce positivamente sulla performance osservata; si sono anche dimostrate utili certe ottimizzazioni che impiegano speciali classi di memoria OpenCL, come la memoria locale e privata. Abbiamo inoltre osservato come l'uso delle tecniche precedentemente descritte incorre in un determinato costo in termini di risorse sulla FPGA. Abbiamo verificato che, nel particolare caso dei cifrari simmetrici a blocchi, non vi è necessità di adottare schemi particolari per accedere efficientemente agli oggetti in memoria globale. Infine, ci siamo confrontati con le limitazioni introdotte dal canale di trasmissione tra il sistema centrale e il dispositivo acceleratore OpenCL, e abbiamo misurato come uno schema di double-buffering permette di lavorare su queste limitazioni, in modo da raggiungere un uso efficiente del canale senza ulteriori costi di area occupata su FPGA. Grazie ai dati raccolti nel benchmark finale dei cifrari a blocchi da noi implementati, abbiamo estrapolato una serie speculativa di limiti superiori per il throughput, in modo da dare uno sguardo al rendimento che potremmo potenzialmente ottenere su un sistema non affetto dalle sopramenzionate limitazioni nel canale.