Modern high-level synthesis : improving productivity with a multi-level approach

High-Level Synthesis (HLS) tools simplify the design of hardware accelerators by automatically generating Verilog/VHDL code starting from a general purpose software programming language, usually C/C++. They include a wide range of optimization techniques in the process, most of them performed on a low-level intermediate representation (IR) of the code. Because of the mismatch between the requirements of hardware descriptions and the characteristics of input languages, HLS tools often rely on users to add specific directives (pragmas) that augment the input specification to guide the generation of optimized hardware. A good result thus still requires hardware design knowledge and non-trivial design space exploration, which might be an obstacle for domain scientists seeking to accelerate applications written, for example, in Python-based programming frameworks. This thesis proposes a modern approach based on multi-level compiler technologies to bridge the gap between HLS and high-level frameworks, and to use domain-specific abstractions to solve domain-specific problems. The key enabling technology is the Multi-Level Intermediate Representation (MLIR), a framework that supports building reusable compiler infrastructure inspired by (and part of) the LLVM project. The proposed approach uses MLIR to introduce new optimizations at appropriate levels of abstraction outside the HLS tool while still relying on years of HLS research in the low-level hardware generation steps; users and developers of HLS tools can thus increase their productivity, obtain accelerators with higher performance, and not be limited by the features of a specific (possibly closed-source) backend. The presented tools and techniques were designed, implemented, and tested to synthesize machine learning algorithms, but they are broadly applicable to any input specification written in a language that has a translation to MLIR. Generated accelerators can be deployed on Field Programmable Gate Arrays or Application-Specific Integrated Circuits, and they can reach ~10-100 GFLOPS/W efficiency without any manual optimization of the code.

Gli strumenti di sintesi ad alto livello (High-Level Synthesis o HLS) semplificano la progettazione di acceleratori hardware generando automaticamente codice Verilog/VHDL da codice scritto in linguaggi di programmazione software, di solito C/C++. Durante questo processo applicano un’ampia gamma di ottimizzazioni, la maggior parte delle quali operano su una rappresentazione intermedia a basso livello di astrazione. A causa dello scarto fra i requisiti di una descrizione hardware e le caratteristiche dei linguaggi in ingresso, gli strumenti di HLS spesso fanno affidamento su direttive (pragma) inserite dagli utenti nel codice per guidare la generazione di acceleratori ottimizzati. Di conseguenza, un buon risultato dipende da conoscenze di progettazione hardware e esplorazione non banale di un ampio spazio di ottimizzazioni, che potrebbero rappresentare un ostacolo per esperti di settore che vogliano accelerare, ad esempio, applicazioni scritte in framework Python. Questa tesi propone un approccio moderno basato su tecnologie di compilatori multi-livello per colmare il divario fra HLS e framework ad alto livello, e per per risolvere problemi domain-specific con astrazioni specializzate. La tecnologia chiave che lo permette è MLIR, un framework che supporta la creazione di infrastrutture per compilatori all’interno del progetto LLVM. L’approccio proposto utilizza MLIR per introdurre nuove ottimizzazioni a livelli di astrazione appropriati, fuori dallo strumento di HLS, e sfrutta anni di ricerca HLS nella generazione dell’hardware a basso livello. In questo modo utenti e sviluppatori di tool HLS possono essere più produttivi, ottenere acceleratori con performance migliore, e non essere limitati dalle caratteristiche di un solo backend (tipicamente closed-source). Le tecniche e gli strumenti utilizzati sono stati ideati, implementati, e testati per sintetizzare algoritmi di machine learning, ma sono ampiamente applicabili a qualunque applicazione scritta in un linguaggio che può essere tradotto in MLIR. Gli acceleratori fenerati possono essere realizzati su FPGA o ASIC, e raggiungono un’efficienza di ~10-100 GFLOPS/W senza nessuna ottimizzazione manuale all’interno del codice.