Extending NoSQL to handle relations in a scalable way. Models and evaluation framework

With Web 2.0 growth of the amount of data accessible on the web has occurred. We live in the Big Data, Big Users, and Cloud Computing era. In this situation, storage solutions such as RDBMS showed their limits concerning the scalability over multiple nodes. Storage systems known as NoSQL databases are becoming increasingly important because are designed to scale well. However, they are not the solution to every problem of data management. In fact, the lack of standardization of the currently available NoSQL implementations, force developers to handle low-level data management issues thus resulting in a higher complexity of programming NoSQL compared to RDBMS solutions. A challenging research objective is therefore to improve the programmability and manageability of NoSQL still keeping their remarkable characteristics in terms of scalability and capability of handling large volumes of data. This thesis aims at offering a contribution toward the achievement of this challenge. In particular, we focus on how to render relations between entities in a NoSQL still avoiding the need of introducing join operators that would impair their scalability. We study two different approaches for doing so, called MinR and MaxR. MinR minimizes data replicas while MaxR maximizes replicas. Also, we will present a rigorous methodology to compare the two strategies. We will expose and motivate a series of tests designed to investigate interesting aspects of the two techniques and will provide the results of the test run using two different types of NoSQL database, MongoDB (a Document-based) and Cassandra (Column-oriented). The main conclusion is that MaxR better then MinR with sparse relations. Also, MaxR is the right choice to frequently read small amounts of data. MinR is good to often maintaining consistency. But from numbers we have also concluded, for example, that for Cassandra usually MinR is better than MaxR and, viceversa, for Mongo the MaxR model is the right choice. Finally, we present an evaluation framework developed for the execution of the tests. The framework, developed in Microsoft .NET, is easily extensible to use with other databases and add more tests.

Il web 2.0 ha avuto come effetto una considerevole crescita della mole di dati fruibile sul web. Viviamo nell'era del Big Data, del Big Users e del Cloud Computing. In questa situazione, soluzioni di storage tradizionale quali gli RDBMS mostrano i loro limiti quando è necessario utilizzare sistemi distribuiti. Sistemi di storage conosciuti con il termine NoSQL stanno diventando sempre più importanti. Essi però non sono la soluzione ad ogni problema di gestione di dati distribuiti. Infatti, la mancanza di meccanismi standard nelle soluzioni esistenti pone lo sviluppatore di fronte a problemi di basso livello nella gestione dei dati che ne complica la programmazione. Un obbiettivo di ricerca interessante è migliorare questi aspetti che sono carenti nei NoSQL, pur mantenendo la loro notevole propensione alla scalabilità. Questa tesi vuole offrire un contributo in questo verso. In particolare ci focalizzeremo sull'introduzione del concetto di relazione tra entità nei NoSQL senza l'utilizzo delle operazioni di join, punto debole della scalabilità. Proponiamo due tecniche alternative per questo, MinR e MaxR. MinR minimizza le repliche dei dati, MaxR le massimizza. Presenteremo anche una metodologia rigorosa per confrontare i modelli. Esporremo e motiveremo una serie di test progettati per valutare gli aspetti più interessanti delle due tecniche e riporteremo i risultati di questi test ottenuti eseguendoli su due differenti database NoSQL: MongoDB (un Document-based) e Cassandra (un Column-oriented). Il risultato principale sarà che MaxR è migliore se utilizzato con relazioni sparse e se si necessità di frequenti letture di piccole quantità di dati. MinR è migliore se si vuole mantenere la consistenza. Dai risultati dei test si può anche concludere che per Cassandra è più adatto il modello MinR e viceversa Mongo si comporta meglio utilizzando MaxR. Verrà infine presentata una piattaforma di test (Evaluation Framework) per l'esecuzione dei test. Il framework è sviluppato nell'ambiente Microsoft .NET ed è facilmente estendibile con nuovi test e per l'utilizzo con diversi database.