Automated acceleration of dataflow-oriented C applications on FPGA-based systems

The end of Dennard scaling over the last two decades has meant that computing systems could no longer achieve exponential performance improvement through higher clock frequency and transistor density due to the power wall problem. Heterogeneous computing systems address this issue by incorporating specialized hardware to achieve better energy efficiency and performance. In this context, Field Programmable Gate Arrays (FPGA) have steadily grown in popularity as hardware accelerators, although the greatest obstacle to their mainstream adoption remains the high engineering cost associated with developing FPGA-based applications. Despite the remarkable improvements in the effectiveness of third-generation High Level Synthesis tools tools, they still require some domain-specific knowledge and expertise to be used effectively. This thesis proposes a methodology and a tool that further increase the accessibility of HLS technology by providing a high level language frontend for the design of dataflow applications on FPGA. This framework allows software developers to write C code without focusing on FPGA-specific optimizations or concepts related to the dataflow model. The tool leverages the LLVM compiler framework to apply dataflow-specific code transformations and FPGA-related optimizations and outputs optimized code ready to be synthesized by state-of-the-art FPGA synthesis tools. A performance model tailored for dataflow computations allows obtaining accurate performance estimates before synthesis for different combinations of available optimizations. An ILP formulation of the optimization problem is then used to obtain the set of optimizations that maximizes throughput while respecting the FPGA's resources constraints. To validate this approach, we have tested the tool on different unoptimized algorithms written in C and we have targeted MaxCompiler as a backend dataflow synthesis tool. We have compared the performance obtained by these automatically optimized designs to their hand-optimized counterparts and obtained performance which ranges from 0.5x speed down to 1.34x speedup, depending on the benchmark. From the point of view of productivity, our automated optimization methodology obtains these results in about a day of work by software developers, as opposed to the several weeks of optimization by expert FPGA developers required to produce the hand-optimized designs. These results show that our methodology allows to optimize the original code and transform it into dataflow code optimized for FPGA synthesis with significantly reduced development effort.

La fine del ridimensionamento Dennard nel corso degli ultimi vent'anni ha fatto si che i moderni microprocessori non potessano ottenere un aumento esponenziale di performance attraverso una frequenza di clock più alta e una maggiore densità di transistor. I sistemi di computazione eterogenei affrontano questo problema incorporando hardware specializzato per ottenere un miglioramento in performance ed efficienza energetica. In questo contesto, le Field Programmable Gate Arrays (FPGA) sono sempre più utilizzate come acceleratori hardware, sebbene l'ostacolo principale contro un'adozione più diffusa di questa tecnologia rimanga il proibitivo costo di sviluppo. Nonostante i notevoli miglioramenti dei tool di High Level Synthesis di terza generazione, questi richiedono comunque esperienza e una conoscienza specifica di dominio per poter essere utilizzati in maniera efficace. L'obbiettivo di questa tesi è proporre una metodologia ed un tool che migliorino l'accessibilità della tecnologia di HLS mettendo a disposizione un frontend per linguaggi di alto livello per il design di applicazioni dataflow su FPGA. Questo framework permette a sviluppatori software di scrivere codice in C senza doversi occupare di ottimizzazioni specifiche alle FPGA o concetti relativi al modello dataflow. Il tool sfrutta il compiler framework LLVM per applicare trasformazioni specifiche per computazioni dataflow e ottimizzazioni relative all'architettura target e produce come output codice ottimizzato, pronto per essere sintetizzato su FPGA da appositi tool commerciali. Un modello di performance specifico per computazioni dataflow permette di ottenere stime di risorse accurate prima della sintesi per diverse combinazioni di ottimizzazioni. Una formulzione ILP è utilizzata per risolvere il relativo problema di ottimizzazione per massimizzare il throughput rispettando le limitazioni in termini di risorse hardware dell'FPGA. Per validare il nostro approccio, abbiamo testato il tool su diversi codici non ottimizzati scritti in C e abbiamo scelto MaxCompiler come tool di backend per la sintesi del design dataflow. Abbiamo comparato le performance ottenute dai design generati attraverso il nostro tool con design ottimizzati manualmente presenti nello stato dell'arte, ottenendo performance variabili da 0.5x a 1.34x in speedup a seconda dei benchmark considerati. In termini di produttività, la metodologia di ottimizzazione automatica proposta richiede circa un giorno di lavoro da parte di uno sviluppatore software per produrre i risultati riportati, rispetto alle settimane di lavoro di ottimizazzione da parte di sviluppatori per FPGA esperti richieste per produrre i design ottimizzati manualmente. Questi risultati mostrano che la metodologia proposta permette di ottimizzare e trasformare il codice in ingresso in un codice dataflow ottimizzato per la sintesi su FPGA, riducendo notevolmente lo sforzo di sviluppo.