An automated design framework for FPGA-based hardware accelerators of convolutional neural networks

Convolutional Neural Networks are a particular type of Artificial Neural Networks (ANNs) inspired by the biological processes in the primary visual cortex of animals, and represent the state of the art in image recog- nition and classification. Nowadays, Convolutional Neural Networks (CNNs) and other Deep Learning algorithms have been extensively adopted in contexts such as big data analysis and smart embedded systems, providing customized technologies through cloud-based services and personalized devices. As regards this type of applications, the huge amount of data to be processed and power constraints require to find techniques to build fast and energy efficient solutions. In particular, the dataflow pattern of CNN algo- rithm make them highly suitable for hardware acceleration. In fact, many hardware accelerators have been proposed based on Graphics Processing Units (GPUs) , Field-Programmable Gate Arrays (FPGAs) , Application-Specific Integrated Circuits (ASICs) . Among them, FPGAs are able to make a proper tradeoff between flexibility, performance and power consumption. However, the design and the implementation of a CNN accelerator on such devices may result complex and time consuming, especially for developers that are not experienced in hardware design. For these reasons, the work presented in this thesis proposes a framework to automatically generate a hardware implementation of CNNs on FPGAs though High Level Synthesis (HLS) tools. The working flow of the framework starts from an high level description of the network, integrating TensorFlow for training and an internally developed C++ library for the final implementation.

Le Reti Neurali Convoluzionali (conosciute come Convolutional Neural Networks) sono un particolare tipo di Rete Neurale Artificiale, il cui funzionamento è ispirato a cellule presenti nella corteccia visiva degli animali, e rappresenta oggi la miglior soluzione per il riconoscimento e la classificazione di immagini. Al giorno d’oggi, le cosiddette Convolutional Neural Networks (CNNs) e altri tipi di algoirtmi appartenenti alla branca del Deep Learning vengono ampiamente utilizzati in contesti come quelli dell’analisi di big data e dei sistemi embedded smart, fornendo delle tecnologie personalizzate attraverso servizi cloud-based e dispositivi come ad esempio smartphones, e smart watches. In questo tipo di applicazioni, l’enorme mole di dati da processare e i vincoli di consumo energetico rendono cruciale l’individuazione di soluzioni che siano sia veloci che efficienti da un punto di vista energetico. In particolare, lo specifico flusso di calcolo di una CNN rende questo tipo di algoritmi estremamente adatti per essere accelerati in dispositivi hardware dedicati. Infatti, molti acceleratori hardware basati su Graphics Processing Units (GPUs) ,Field-Programmable Gate Arrays (FPGAs) ed Application-Specific Integrated Circuits (ASICs) sono stati proposti a questo scopo. Tra questi, le FPGA sono in grado di fornire un giusto compromesso tra flessibilità, performance e consumo energetico. Tuttavia, il design e l’implementazione di un acceleratore per una CNN su questo tipo di dispositivi potrebbe risultare sia complesso che oneroso in termini di tempo di sviluppo, specialmente per sviluppatori con poca esperienza di progettazione hardware. Per questi motivi, il lavoro proposto in questa tesi propone un framework in grado di generare automaticamente un’implementazione hardware di CNN su FPGA attraverso strumenti di High Level Synthesis (HLS). Il flusso di lavoro del framework parte da una descrizione ad alto livello della rete, integrandosi con il framework di Machine Learning TensorFlow per l’addestramento e una libreria C++ sviluppata internamente per l’implementazione finale.