Design, implementation and performance analysis of RStream, a library for distributed stream and batch processing.

In recent years the ever increasing amount of data gathered created the necessity for distributed data processing libraries and frameworks capable of processing high volumes of data. Use cases such as data analytics, machine learning algorithms and ETL processes require flexible and scalable systems with a low implementation complexity for rapid prototyping delegating the parallelization, process coordination and network communication to the processing frameworks. Over the last decade several distributed stream and batch processing frameworks such as MapReduce, Apache Spark and Apache Flink were developed to satisfy these requirements. These systems are written in Java or Scala relying on the Java Virtual Machine for code execution offering high level APIs focusing on usability and flexibility but compromising on efficiency and performance. The results of such design are distributed processing frameworks that are suitable to use only at very large scales where the horizontal scalability is able to offer better performance than non distributed systems relying on vertical scalability of single machines. The focus of this thesis is the design and implementation of RStream, a library for distributed stream and batch processing implemented in Rust to offer a system for parallel computing convenient to use at smaller scales and able to be faster and more efficient at large scales without compromising the usability and expressiveness of the APIs. We present the design and implementation choices taken to offer high performance and scalability together with a suite of benchmarks to compare the performance achieved by RStream with the state-of-the-art in stream and batch processing represented by Apache Flink and with ad hoc solutions implemented in C++ using the MPI and OpenMP libraries for parallelization.

Negli ultimi anni l'incremento della quantità di dati raccolti ha creato un fabbisogno di frameworks per il calcolo distribuito in grado di processare grandi volumi di dati. Processi di data analytics, algoritmi di machine learning e procedure ETL richiedono sistemi flessibili e scalabili con un bassa complessità di implementazione per permettere una rapida prototipazione delegando la parallelizzazione, coordinazione dei processi e comunicazione di rete della applicazione al framework di calcolo. Nell'ultimo decennio vari sistemi distribuiti per stream e batch processing come MapReduce, Apache Spark e Apache Flink sono stati sviluppati per soddisfare questi fabbisogni. Questi sistemi sono scritti in Java o Scala utilizzando la Java Virtual Machine per l'esecuzione del codice offrendo API di alto livello concentrandosi sull’usabilità e flessibilità scendendo a compromessi nell'efficienza e performance del sistema. Il risultati di questo design sono sistemi distribuiti per l'elaborazione dei dati adatti all'uso solo su vasta scala dove la scalabilità orizzontale è in grado di offrire migliori prestazioni di sistemi non distribuiti dipendenti dalla scalabilità verticale di singole macchine. Il focus di questa tesi è il design e l'implementazione di RStream, una libreria per stream e batch processing distribuito implementata in Rust per offrire un sistema per il calcolo parallelo conveniente da utilizzare su piccola scala e capace di essere veloce ed efficiente su vasta scala senza compromettere l'usabilità e l'espressività delle API. Presentiamo le scelte di design e implementazione prese per offrire alte prestazioni e scalabilità utilizzando una suite di benchmarks per comparare i risultati ottenuti da RStream con lo stato dell'arte rappresentato da Apache Flink e soluzioni ad hoc implementate in C++ usando le librerie MPI e OpenMP per la parallelizzazione.