A HW/SW co-design framework for TinyML acceleration

Deep Neural Networks (DNN) are achieving impressive results for Artificial Intelligence (AI) problems including image and speech recognition, self-driving cars, task management, problems optimization and so more. These outstanding results are accomplished through the high computational complexity that systems have to handle at the cost of memory allocation and power consumption. Thus, specific hardware accelerators have been designed and developed, to execute these Neural Networks (NNs) models efficiently. On the other hand, research is also focusing on the challenge of bringing data analysis closer to the sources that produce these data; for example in IoT scenarios, where analysis is usually offloaded to Clouds but the data are collected by the devices on edge. TinyML is the field that addresses this challenge. Many TinyML frameworks were born to make feasible the deployment and execution of NN models on resource-constrained devices, such as Micro-Controller Units (MCUs). However, such frameworks implement only basic NN operations, which are even not optimized for the target architectures. In this context, this thesis project takes shape, with the ambition of combining the workflow of a TinyML framework with the one of a hardware accelerator. First, in this thesis, two open-source frameworks for TinyML have been compared: MicroTVM and TensorFlow Lite for Microcontrollers (TFLM), and we decided to adopt TFLM as the baseline framework. Then we developed what we called a Hybrid framework. The latter is used to deploy a reduced version of a VGG-16 on an embedded processor, but the execution of two of the most computationally demanding convolutional layers (in terms of MACs) is offloaded to the STMicroelectronics proprietary Neural Processing Unit (NPU). Moreover, we created an exploration SW module that automatically compares the latencies of 162 different single-layer convolutional models executed using our proposed hybrid approach and the standard one. This module can also be used to compare even more layers or specific convolutional layers. Finally, we analyzed the results obtained. For the application, we have identified an average percentage decrease, in terms of latency, of 34,9% considering our hardware architectures and the layers we have accelerated. The exploration module, instead, showed that in 80% of the considered cases there is a percentage decrease, in terms of latency, greater than 90%.

Le reti neurali profonde (DNN) stanno ottenendo risultati impressionanti nel campo dell'Intelligenza Artificiale (AI), tra cui riconoscimento di immagini e discorsi, auto a guida autonoma, gestione delle attività, ottimizzazione dei problemi e altro ancora. Questi risultati eccezionali sono ottenuti grazie all'elevata complessità computazionale che i sistemi devono gestire a scapito dell'allocazione della memoria e del consumo energetico. Pertanto, sono stati progettati e sviluppati specifici acceleratori hardware per eseguire questi modelli di reti neurali (NN) in modo efficiente. D'altra parte, la ricerca si sta concentrando anche sulla sfida di avvicinare l'analisi dei dati alle fonti che producono questi dati; ad esempio negli scenari IoT, in cui l'analisi viene solitamente condotta sul Cloud ma i dati vengono raccolti dai dispositivi sull'edge. TinyML è il campo che affronta questa sfida. Molti framework TinyML sono nati con l'obiettivo di rendere fattibile il deployment e l'esecuzione di modelli NN su dispositivi con risorse limitate, come le Micro-Controller Unit (MCU). Tuttavia, tali framework implementano solo operazioni di base, le quali spesso non sono ottimizzate per le architetture target. In questo contesto prende forma questo progetto di tesi, con l'ambizione di integrare il flusso di lavoro di un framework di TinyML con quello di un acceleratore hardware. In primo luogo, in questa tesi sono stati analizzati e confrontati due framework open-source per il TinyML: MicroTVM e TensorFlow Lite for Microcontrollers (TFLM), e abbiamo deciso di adottare TFLM come framework di riferimento. Quindi abbiamo sviluppato quello che abbiamo chiamato un framework ibrido. Quest'ultimo viene utilizzato per il deploying di una versione ridotta di un VGG-16 su un processore incorporato, ma l'esecuzione di due dei livelli convoluzionali più impegnativi dal punto di vista computazionale (in termini di MAC) viene scaricata sull'unità di elaborazione neurale (NPU) proprietaria di STMicroelectronics. Inoltre, abbiamo creato un modulo SW di esplorazione, il quale confronta automaticamente le latenze di 162 diversi modelli convoluzionali a layer singolo i quali sono eseguiti utilizzando il nostro approccio ibrido e quello standard. Questo modulo può essere utilizzato anche per confrontare più layer o semplicemente layer convoluzionali specifici. Infine, abbiamo analizzato i risultati ottenuti. Per l'applicazione abbiamo individuato una diminuzione percentuale media, in termini di latenza, del 34,9% considerando le nostre architetture hardware e i layer che abbiamo accelerato. Il modulo di esplorazione, invece, ha mostrato che nell'80% dei casi considerati si registra una diminuzione percentuale, in termini di latenza, superiore al 90%.