Big Data improvements in cluster analysis

The expression "Big Data'' has become very popular in the last few years though it does not concern exclusively large volumes of data. In fact, it is more connected to the way the data needs to be treated and the consequential challenges, called the "Three V'', that stand for Volume (treatment of large datasets) , Velocity (quickness of analysis), and Variety (handle of unstructured data, such as texts, images, and videos). This paper discusses the use of clustering in this context, though not considering the challenge of variety. Clustering consists in the segmentation of a set of objects through the identification of features, so to group them accordingly as much as possible. By all means, this approach reduces the size of the problems involved, as a wide dataset can be handled as though as it were a small set of clusters. Furthermore, the paper describes efficient algorithms, which can be used to create the appropriate tools to allow quick data processes, thereby dealing effectively with the velocity challenge. For this purpose, the choice of software framework went for Hadoop, as it allows a cheap processing of large volumes of data and the handling of unstructured data. The logic upon which it is based is the parallelization of processes using a cluster of computers. For this purpose, the clustering algorithms have been developed through a specific programming model, i.e. MapReduce, since it allows the parallelization of tasks. Therefore, some of the current clustering algorithms have been converted to the MapReduce structure, while others have been developed straight away in that manor. Once the tools were designed, the testing was conducted on some simulated datasets. The first stage regarded the effectiveness, i.e. the capability of identifying correctly some unusually shaped clusters. Therefore, the used dataset were small-sized. Consequently, the efficiency testing aimed to cluster the big dataset more rapidly. For this stage, the used tool was an Amazon cluster of 5 computers. Although the tested volume was still pretty small, it is possible to estimate the performance changes as the dataset grow. As a matter of fact, one of the MapReduce peculiarities is its scalability, i.e. the capability to increase linearly the computational power as the resources grow. Hence, if the size of the cluster is proportional to the data volume, the performances are approximately constant. In conclusion, the design and the development of these new clustering algorithms in MapReduce combines the logics of two current classes of clustering algorithms. By all means, this approach has the advantages of both and gives, therefore, a new range of efficient analytical methodologies and consequential results.

Negli ultimi anni, si sta parlando sempre più spesso di "Big Data'', riferendosi non solo a grandi moli di dati. Infatti, l'espressione riguarda alcune nuove necessità e le conseguenti sfide, dette le "Tre V'': Volume, cioè gestione di grandi moli; Velocità, cioè rapidità di analisi; Varietà, cioè elaborazione di dati non strutturati, come testi, immagini e video. Questa tesi tratta l'utilizzo di tecniche di clustering in questo nuovo contesto. Il clustering consiste nella segmentazione di un insieme di oggetti in gruppi che siano il più possibile omogenei. Di fronte a grandi moli di dati, il clustering è uno strumento potente che produce un piccolo insieme di gruppi, facilmente trattabile. Inoltre, le tecniche presentate sono particolarmente efficienti, quindi uno strumento di calcolo adeguato risolve il problema della velocità. Per quanto riguarda la varietà, esistono strumenti di clustering che trattano anche dati non strutturati, ma non sono parte della tesi. Il software che è stato scelto è "Hadoop'', molto utilizzato in ambito "Big Data'' in quanto permette di gestire grandi moli di dati con un costo contenuto e di trattare dati non strutturati. Esso è basato sulla gestione di grandi volumi mediante la distribuzione del lavoro su un cluster di computer. A tal fine, gli algoritmi sono stati sviluppati in uno specifico paradigma, detto "MapReduce'', che consente la loro parallelizzazione mediante Hadoop. Per questo motivo, alcuni algoritmi di clustering già esistenti sono stati adattati alla struttura del MapReduce e altri sono stati sviluppati direttamente seguendo questa logica. Il lavoro di tesi è consisito nello sviluppo di alcuni algoritmi, che sono stati poi testati su dataset simulati. La prima fase di testing ha riguardato l'efficacia degli algoritmi, cioè la loro capacità di segmentare correttamente un insieme di oggetti. A tal fine, sono stati utilizzati dataset di piccole dimensioni e aventi caratteristiche particolari. L'altra fase ha riguardato l'efficienza, cioè la rapidità di esecuzione, e il testing è stato condotto su dataset di dimensioni maggiori tramite un cluster Amazon di 5 nodi. Nonostante il volume dei dati trattati sia ancora relativamente piccolo, è possibile stimare le prestazioni su moli maggiori. Infatti, il MapReduce ha la peculiarità di essere scalabile. Questo significa che la potenza di calcolo cresce linearmente all'aumentare delle risorse, quindi è sufficiente aumentare il numero di nodi in proporzione alla mole di dati da processare per ottenere le stesse prestazioni. In conclusione, la parte innovativa del lavoro di tesi consiste nella progettazione e implementazione di algoritmi di clustering in MapReduce. Essi sono basati sulla combinazione di logiche di algoritmi già esistenti, riadattate nel nuovo paradigma.