On how to facilitate hardware acceleration of machine learning for non-experts in hardware design over the Edge, the Fog, and the Cloud

Data scientists cannot ignore the irruption of Machine Learning (ML) in their research field. Indeed, models that can learn based on data examples have been gaining increasing attention, fueled by the high Volume, Veracity, Velocity, and Variability of Big Data, because "More Data Beats a Cleverer Algorithm" in ML. Nevertheless, ML models have the terrible name of neither being explainable nor efficient. Fortunately, balancing these benefits and drawbacks in ML modeling has recently become possible, thanks to Information Technology's (IT) renaissance of specialization, with Domain Specific Architectures (DSAs) increasingly often replacing general-purpose software solutions based on Central Processing Units (CPUs) or general-purpose Graphical Processing Units (gpGPUs). Unfortunately, designing, developing, programming, and deploying such DSAs on Application-Specific Integrated Circuit (ASIC) accelerators or, even worse, Field Programmable Gate Arrays' (FPGAs) Programmable Logic (PL) requires hardware design skills that rarely intersect with the area of expertise of Data Scientists and ML specialists. The overarching research question, addressed by the Ph.D. research project presented in this dissertation, tackles exactly this trade-off and investigates "how to facilitate ML experts in the exploitation of the benefits of DSAs for hardware acceleration, without having them to master hardware design and development." Three main research themes arise from this question: methods and tools for automatic translation of ML models into hardware accelerators; methods and tools for increasing the ease of access to the technological platforms enabling hardware acceleration; programming models that support ML experts to remain focused on their area of expertise, i.e., the Data, while exploiting hardware acceleration over distributed infrastructures. The first theme is tacked by Entree, the first toolchain in the State of the Art for deploying large Decision Tree (DT) ensembles' inference (an explainable class of ML models) over embedded devices mounting an FPGA. It automatically converts the scikit-learn trained model to a hardware accelerator. Then, it supports the developer in fitting such accelerated models on embedded FPGAs even if they would statically exceed the available resources on the onboard PL. This optimization is achieved by employing a novel DSA based on partial dynamic reconfiguration made available by recent advancements in heterogeneous Systems-on-a-Chip (SoCs). Moreover, apart from the increased usability for non-experts in hardware design, Entree attains latency jitters up to hundreds of times lower than those obtained on embedded CPUs. The second theme is addressed by BlastFunction and Plaster, two frameworks for distributing hardware-accelerated algorithms over the Cloud and the Fog, respectively. The former exploits a registry-based OpenCL extension for FPGA time-sharing over large Cloud infrastructures. The latter proposes a set of event-driven Application Programming Interfaces (APIs) for splitting data-intensive tasks (such as those related to ML) over multiple Fog nodes powered by FPGAs. Once Edge, Fog, and Cloud devices have been enabled by the contributions listed so far, the third theme is targeted by the Virtual Sensor Domain-Specific Language (DSL). This extension of the C++ programming language allows Data Scientists to create abstract sensors that measure high-level concepts instead of raw figures. This facilitation is obtained by only focusing at the language level on how the Virtual Sensor should source, aggregate, and process the necessary data, without any knowledge of the actual Edge-Fog-Cloud interface fulfilling it. Thanks to the Virtual Sensor DSL, a commercial toolchain for automatic code generation and workload distribution has been developed and released, completing the effort to ease access to hardware acceleration for non-experts in hardware design. These contributions altogether give a robust and positive answer to the overarching research question that guided the project throughout, paving the way towards the extension of the approach to ML models other than DT ensembles and other applicative fields apart from the Internet of Things (IoT) targeted so far. Furthermore, the toolchains and frameworks presented in this dissertation are all designed with extensibility and contribution in mind, in an effort to build a solid foundation for bridging the gap between the two fields of expertise of ML and hardware acceleration. Finally, a distant goal of the project reaches the Artificial Intelligence (AI) world, bringing the same benefits of hardware-accelerating for ML to other pillar fields of this rising discipline, such as reasoning and knowledge representation.

Gli esperti di Data Science non possono più ignorare l'irruzione del Machine Learning (ML) nel loro ambito di ricerca. Modelli che possono apprendere in base ad esempi stanno ottenendo sempre maggiore attenzione, spinti dal Volume, dalla Veridicità, dalla Velocità e dalla Variabilità dei Big Data, perché "Più Dati battono Algoritmi Più Intelligenti" nel ML. Ciò nonostante, i modelli di ML si sono guadagnati la pessima fama di non essere né spiegabili né efficienti. Fortunatamente, bilanciare i benefici e le mancanze della modellazione con ML è recentemente diventato possibile grazie alla rinascita della specializzazione che l'Information Technology (IT) sta vivendo, con le Domain Specific Architecture (DSA) che progressivamente sostituiscono le soluzioni software su Central Processing Unit (CPU) e general-purpose Graphical Processing Unit (gpGPU). Sfortunatamente, progettare, sviluppare, programmare ed installare queste DSA su Application-Specific Integrated Circuit (ASIC) o, ancora peggio, sulle logiche programmabili dei Field Programmable Gate Array (FPGA) richiede capacità di progettazione hardware che raramente si intersecano con quelle nell'area d'esperienza degli specialisti in Data Science e ML. La domanda di ricerca che sottende l'intero progetto di dottorato presentato in questa dissertazione insiste esattamente su questo compromesso e investiga "come è possibile aiutare gli esperti di ML nell'impiego dei benefici delle DSA per l'accelerazione hardware, senza che questi debbano padroneggiare la progettazione hardware." Tre temi di ricerca principali scaturiscono da questa domanda: metodi e strumenti per la traduzione automatica di modelli di ML in acceleratori hardware; metodi e strumenti per aumentare la facilità nell'accedere alle piattaforme tecnologiche di accelerazione hardware; modelli di programmazione che supportino gli esperti di ML nel rimanere focalizzati sulla loro area d'esperienza, i Dati, mentre sfruttano acceleratori hardware su infrastrutture distribuite. Il primo tema è ciò a cui si rivolge Entree, la prima toolchain nello Stato dell'Arte per computare l'inferenza grandi ensemble di alberi di decisione (Decision Tree (DT) ensemble) su dispositivi embedded che montano un FPGA. Entree converte automaticamente i modelli allenati con scikit-learn nel loro acceleratore hardware. Inoltre, supporta lo sviluppatore nell'adattare l'acceleratore in modo da poterlo programmare su FPGA embedded, anche se questo dovesse richiedere staticamente più risorse di quelle disponibili sulla logica programmabile, grazie all'impiego della riconfigurazione dinamica parziale resa disponibile dai recenti miglioramenti in campo di System-on-a-Chip (SoC) eterogenei. Inoltre, Entree riesce ad ottenere oscillazioni di latenza più di cento volte inferiori a quanto possibile su CPU embedded. Il secondo tema è ciò di cui si occupano BlastFuncion e Plaster, due framework per distribuire algoritmi accelerati in hardware, rispettivamente, sul Cloud e sul Fog. Il primo sfrutta un'estensione di OpenCL per la condivisione di FPGA su infrastrutture Cloud. Il secondo propone un insieme di Application Programming Interface (API) ad eventi per la suddivisione di compiti con trasferimento di dati intensivo (come classicamente accade nel ML) su più nodi Fog equipaggiati con FPGA. Visti i contributi descritti sin qui, i quali hanno aumentato le capacità di dispositivi Cloud, Fog e Edge, il terzo tema è ciò di cui si occupa il Virtual Sensor Domain-Specific Language (DSL). Questa estensione del linguaggio di programmazione C++ permette agli esperti di Data Science di creare sensori astratti in grado di misurare concetti a loro volta astratti focalizzandosi solamente su come il Virtual Sensor debba ottenere, aggregare e processare i dati necessari, senza nessuna conoscenza rispetto all'infrastruttura Edge-Fog-Cloud che dovrà poi realizzarlo. Grazie al Virtual Sensor DSL è stata sviluppata e rilasciata commercialmente una toolchain per la generazione automatica di codice e per la distribuzione automatica dei carichi computazionali, completando gli sforzi per facilitare l'accesso all'accelerazione hardware per non esperti di progettazione hardware. Unitamente, questi contributi danno una riposta robusta e positiva alla domanda di ricerca che ha guidato l'intero progetto, aprendo la strada all'estensione di questo approccio ad altri modelli di ML rispetto ai DT ensemble e ad altri ambiti rispetto all'Internet of Things (IoT). Inoltre, le toolchain e i framework presentati in questa dissertazione sono tutti progettati con estrema attenzione rispetto all'estendibilità e alla contribuzione indipendente, nello sforzo di gettare solide fondamenta per avvicinare i due campi del ML e dell'accelerazione hardware. In conclusione, gli obiettivi di lungo termine del progetto puntano al mondo dell'intelligenza artificiale, portando gli stessi benefici dell'accelerazione hardware ottenuti per il ML ad altri pilastri di questa disciplina, come il reasoning e la rappresentazione della conoscenza.