A distributed FPGA-based embedded system for face detection via deep neural network

Inside the big family of artificial intelligence field, machine learning techniques are increasingly used in different areas such as speech recognition, data mining, computer vision and the like. Among all these techniques, based on biological inspiration, Deep Neural Networks (DNNs) are one of the most effective methods capable to reach almost human accuracy. Their classification and recognition accuracy as well as their robustness, demonstrated in different and heterogeneous classes of problems, make DNNs one of the state of the art methodologies. Face detection is one of the most studied topics due to countless applications that require to apply face detection as a first step being it the initial stage of all modern human-computer and human-robot interaction systems, such as biomedical analysis, system security, autonomous automotive, and so on. In such a systems embedded processors are usually used, due to their restricted power consumption. A relatively small DNN, even with less then ten layers, requires big computational effort and, consequently a high power demand and memory storage. These requirements become a great constraint when looking to embedded scenarios. Looking to the different silicon alternatives, Field Programmable Gate Arrays (FPGAs) seem to be an optimal solution for DNNs since they require reduced power consumption with respect to Graphic Processing Units (GPUs), while, differenctly from Application Specific Integrated Circuit (ASIC) solution, guaranting reconfigurability for all, even small, changes in the model. In order to shrink the dimension of a DNN model, Model Quantization, also known as Model Discretization, is a widely used technique acknowledged in the literature that allows faster inference thanks to reduced precision. A 4X reduction in memory storage space, and memory accesses, is obtained by moving from floating point data type to unsigned 8-bits. The complexity of any given model can be reduced by decreasing the precision requirements, with almost no accuracy loss thanks to their robustness. In general, DNNs are able to cope very well with high levels of noise in their inputs, low-level precision computation is simply seen as another sources of noise. Since it would not be reasonable to fit a DNN on a single embedded system while maintaing adequate performances, in our vision, a pipelined distributed FPGA based system can offer the right balance of power consumption and perfomances when dealing with this kind of solutions. To prove our vision, we realized a pipelined distributed system design following a pipeline fashion in which different portions of a DNN are mapped on the single stages. The actual convolution is performed through a highly customisizable kernel, which, appling different levels of parallelisms, is capable to fit more computation on the target FPGA. Thanks to these different levels of parallelisms and to the pipeline structure, our distributed system reaches around 60X of speed-up with respect to a fully-optimized software implementation while reducing the power consumption.

Nell'ambito dell'Artificial Intelligence (AI), sempre più campi applicativi richiedono l'utilizzo di tecniche di Machine Learning (ML), a partire dall'automazione industriale fino ad arrivare al settore biomedicale. Tra queste tecniche, le Deep Neural Networks (DNNs) sono tra quelle più utilizzate, in quanto capaci di raggiungere elevati livelli di affidabilità e accuratezza in svariati problemi di classificazione e riconoscimento tra loro diversi ed eterogenei. Queste motivazioni hanno reso le DNNs l'odierno stato dell'arte. Uno dei principali campi di ricerca, sia accademica come anche industriale, è l'utilizzo di tali modelli per effettuare face detection, in quanto rappresenta il primo step di molti sistemi moderni per l'interazione tra uomo e macchina, come ad asempio system security, analisi biomedicali e sistemi di guida autonoma. Essendo il ridotto consumo di potenza uno dei vincoli principali di tali sistemi, si rivela essere necessario l'utilizzo di processori embedded, caratterizzati da ridotte risorse computazinali ma dal basso profilo energetico. Tuttavia, anche i più semplici modelli basati su DNN richiedono un alto sforzo computazionale, e dunque un alto consumo di potenza oltre che a un elevata richiesta di memoria. Di consequenza, analizzando le diverse architetture disponibili, le Field Programmable Gate Arrays (FPGAs) sembrano essere l'equilibrio ottimale tra prestazioni ed efficienza energetica. Tali dispositivi sono infatti in grado di offrire flessibilità rispetto a una soluzione Application Specific Integrated Circuit (ASIC), in quanto riconfigurabili, e al contempo prestazioni competitive rispetto a soluzioni basate su Graphic Processing Unit (GPU) in termini di efficienza energetica. Le dimensioni di tali modelli deve essere ridotta in modo da poter essere portata su sistemi di tipo embedded. A tale fine, nello stato dell'arte viene spesso sfruttato un processo noto come quantizzazione, o anche discretizzazione, il quale, tramite una computazione a precisione ridotta, consente di ridurre sia la memoria necessaria, e i suoi accessi, di un fattore 4X come anche il tempo di esecuzione, pur garantendo un livello analogo di accuratezza e affidabilità. Caratteristica principale di tali modelli è, infatti, la robustezza anche in condizioni di alto rumore; la quantizzazione viene quindi vista come un ulteriore sorgente di rumore che la rete è in grado di filtrare. Secondo la nostra visione, pur applicando la quantizzazione, un sistema basato su una singola FPGA embedded non garantirebbe un adeguato livello di performance. Per dimostrare la nostra intuizione, abbiamo realizzato un sistema distribuito, formato da un cluster di FPGAs basato su una struttura a pipeline nel quale ogni nodo implementa una porzione di DNN. Grazie a tale struttura e a un set di IP che sfruttano diversi livelli di parallelismo, si è stato in grado di ottenere un 60X di speed-up rispetto a un implementazione software altamente ottimizzata ottenendo inoltre una maggiore efficienza energetica.