Design and implementation of an efficient distributed stream processing system with MPI

Big Data technologies are becoming more and more relevant in the modern world due to the fact that traditional data processing technologies are not able to meet all the new needs. One of the most significant cases is the processing of the data formed in continuous streams, as it needs to be performed in real time while constantly receiving new data. The most prevalent approach to building systems and frameworks that support these kind of operations is called “cluster computing” and one of its most prominent features of it is the usage of programming models that enable concurrent distributed computation that is specified on a higher abstraction level. However, these systems can have performance issues caused by the fact that they are not able to use all of the resources available on the machines in the most efficient way. At the same time, other approaches like “high performance computing” handle this by remaining on a low level, but that means that solutions are very problem specific and difficult to implement. This thesis work intends to lay the foundations in trying to take the best of these approaches by developing the first prototype of an efficient data stream processing framework that is able to provide very high performance levels by using low level technologies while remaining general and relatively easy to use. In order to achieve this, one of the best data stream processing frameworks, Apache Flink, will be analyzed so that its architecture and design choices, together with the Actor Model for concurrent computation, will be used to design the proposed solution. This system will be written in C and, in order to facilitate the communication between the distributed computing nodes, the Message Passing Interface will be utilized. The architecture and description of the implementation of the proposed system will be showcased and then the results from extensive testing performed in order to confirm the efficiency will be presented.

Le tecnologie dei Big Data stanno diventando sempre più rilevanti nel mondo moderno poiché quelle tradizionali di elaborazione dei dati non sono in grado di soddisfare tutte le nuove esigenze. Uno dei casi più significativi è l'elaborazione dei dati in flussi continui, poiché devono essere generati in tempo reale mentre si ricevono costantemente nuovi dati. L'approccio più diffuso alla costruzione di sistemi e framework che supportano questo tipo di operazioni è chiamato "cluster computing" e una delle sue caratteristiche più importanti è l'uso di modelli di programmazione che consentono il calcolo distribuito simultaneo, presentato con un livello di astrazione più elevato. Tuttavia, questi sistemi possono avere problemi di prestazioni causati dal fatto che non sono in grado di utilizzare tutte le risorse disponibili sulle macchine nel modo più efficiente. Altri approcci come l' "high performance computing" gestiscono questo problema rimanendo a un livello inferiore, ma ciò significa che le soluzioni sono molto specifiche e difficili da implementare. Questa tesi si propone di prendere il meglio da questi approcci sviluppando il primo prototipo di framework di elaborazione del flusso di dati efficiente in grado di fornire livelli di prestazioni molto elevati utilizzando tecnologie di basso livello rimanendo flessibile e relativamente facile da usare. Per raggiungere questo obiettivo, uno dei migliori framework di elaborazione del flusso di dati, Apache Flink, sarà analizzato in modo da utilizzare la sua architettura e le sue scelte di progettazione, insieme al Actor Model per il calcolo simultaneo, per progettare la soluzione proposta. Questo sistema sarà scritto in C e, al fine di facilitare la comunicazione tra i nodi di calcolo distribuiti, verrà utilizzato il Message Passing Interface. Si mostrerà l'architettura e la descrizione dell'implementazione del sistema proposto e verranno presentati i risultati di test approfonditi effettuati per confermarne l'efficienza.