Data replication in distributed systems and its impact on the performance of an on-disk key-value database

The topics covered in my thesis are focused at the specific case of data replication in relation to distributed databases, a wide and huge argument discussed nowadays under various aspects. Data replication is the process of storing the same data across several sites to improve availability, consistency, accessibility, and fault-tolerance. One common use of data replication is for disaster recovery, since it reproduces data across multiple distinct zones. The challenge is to preserve an equilibrium between costs and higher storage/processor, or bandwidth and consistency, or again response time and throughput. Starting from the theoretical definition of essential concepts like Big Data, Cloud computing, the attention is mainly spent on the multi cloud environment, identifying the industry cloud providers leaders and delving into the characteristics of each of them, deepening the territorial distribution. Then an in-depth analysis is carried out between NoSQL and SQL databases, comparing the market databases’ leaders and introducing the CherryTable case-study. After observing the differences and similarities of the examined databases, a database with characteristics similar to CherryTable is selected with the aim of testing the latency and the throughput. The final purpose is to observe through a test how CherryTable behaves when it is installed on a cluster of servers, considering during the test different replication factors.

Gli argomenti trattati nella mia tesi sono focalizzati al caso specifico della replicazione dei dati in relazione a database distribuiti, un argomento ampiamente discusso oggi sotto vari aspetti. La replica dei dati è il processo di archiviazione degli stessi dati su più siti per migliorare la disponibilità, la coerenza, l’accessibilità e la tolleranza agli errori. Un utilizzo comune della replica dei dati è il disaster recovery, poiché riproduce i dati su più zone distinte. La sfida consiste nel preservare un equilibrio tra costi e un grande storage/processore, o tra larghezza di banda e coerenza, o ancora tra tempo di risposta e throughput. Partendo dalla definizione teorica di alcuni concetti essenziali come i Big Data, il Cloud computing, l'attenzione si è per lo più incentrata sull'ambiente multi cloud, individuando i cloud providers leader del settore, approfondendone le caratteristiche, ed esaminando la loro distribuzione territoriale. Successivamente è stata effettuata un'analisi accurata dei database NoSQL e SQL, confrontando i database leader del mercato e introducendo il caso di studio CherryTable. Dopo aver osservato le differenze e le somiglianze dei database presi in considerazione, è stato scelto un database con caratteristiche simili a CherryTable con l'obiettivo di testare la latenza e il throughput. Lo scopo finale è di osservare attraverso un test come si comporta CherryTable quando viene installato su un cluster di server, considerando durante il test diversi fattori di replica.