Noir : design, implementation and evaluation of a streaming and batch processing framework

Nowadays, datasets have become so huge that it is impossible to analyze them using only the resources of a single computer. To be able to process them in a timely manner, computations need to be distributed on clusters of multiple machines. Unfortunately, programming distributed software systems is very difficult. Since the advent of distributed computing, researchers and practitioners have been striving to devise abstractions that are flexible and easy to use, yet efficient and scalable. On one extreme, the option is to implement ad-hoc solutions for each processing task, exploiting low-level facilities to handle the communication and the coordination between the different machines. This approach is very flexible, and it is able to extract as much performance as possible from the available resources. However, the development of these custom solutions is usually time-consuming, and the resulting code can be very complex and hard to maintain. To address these drawbacks, many data processing frameworks, such as Apache Spark and Apache Flink, were developed in the recent years. These systems automatically handle the parallelization of the computations, providing to the user a rich set of features that can be used to easily implement processing pipelines. These frameworks, however, are not able to provide performance on par with that of ad-hoc solutions. This thesis presents Noir, a novel stream-processing framework implemented in Rust. Its objective is to fill the gap existing between ad-hoc solutions and distributed processing frameworks, providing better performance than the latter while maintaining their simplicity and ease of use. Even providing similar expressiveness to Apache Flink, our evaluation shows that Noir is able to achieve up to 30× its throughput, and it rivals custom MPI solutions in some workloads.

Al giorno d'oggi, i dataset sono così immensi che è impossibile analizzarli usando solo le risorse di un singolo computer. Per poterli elaborare in modo tempestivo, è necessario distribuire la computazione su un cluster di più macchine. Purtroppo, programmare correttamente un sistema distribuito è estremamente difficile. Dall'avvento del distributed computing, ricercatori e sviluppatori si sono impegnati per ideare astrazioni flessibili e facili da usare, ma che siano allo stesso tempo anche efficienti e scalabili. Da un lato, l'opzione è quella di implementare soluzioni ad-hoc per ciascun task, sfruttando interfacce di basso livello per gestire la comunicazione e il coordinamento tra le diverse macchine. Questo approccio è molto flessibile ed è in grado di ottenere le massime prestazioni possibili dalle risorse disponibili. Tuttavia, lo sviluppo di queste soluzioni personalizzate richiede solitamente molto tempo e il codice risultante può essere molto complesso e difficile da mantenere. Per sopperire a questi inconvenienti, negli ultimi anni sono stati sviluppati molti framework di elaborazione dati, come Apache Spark e Apache Flink. Questi sistemi gestiscono automaticamente la parallelizzazione, fornendo all'utente un ricco insieme di funzionalità che possono essere utilizzate per implementare facilmente le pipeline di elaborazione. Questi framework, però, non sono in grado di fornire prestazioni alla pari di soluzioni ad-hoc. Questa tesi presenta Noir, un nuovo framework per lo stream-processing implementato in Rust. Il suo obiettivo è quello di colmare il divario esistente tra soluzioni ad-hoc e framework di elaborazione distribuita, fornendo prestazioni migliori di questi ultimi pur mantenendo la loro semplicità e facilità d'uso. Anche fornendo un'espressività simile ad Apache Flink, la nostra analisi mostra che Noir è in grado di raggiungere fino a 30× il suo throughput, riuscendo a competere con MPI in alcuni casi.