Model-based clustering techniques for performance parameter estimation

Capacity planning and performance models rely on the understanding of the past history of a computer system in order to predict the utilization under future workloads. According to the utilization law, throughput and utilization are linearly related and their measurements can be used to estimate the service time, a critical parameter of queuing network models. However, due to configuration changes taking place during an observation period, workload and utilization measurements tend to group themselves into multiple linear structures. To estimate the service time of the underlying performance models, the different configurations have to be identified. In this thesis we present two different algorithms to perform performance parameter estimation when the system's configuration is subject to changes. Both methods have been designed to tolerate outliers, using techniques borrowed from robust statistics. The first algorithm works solely on the basis of the observed data and simultaneously estimates the number of clusters, the cluster membership and the regression lines. This method obtains better results than previous works on clusterwise regression and is based on three phases: clustering on the basis of density, splitting in linear clusters and merging according of the distribution of the residuals. This method has further been extended with a visual mining procedure to bring out the relationship between cluster membership and the timestamps. The second approach exploits timestamps directly in the clustering process to detect change points and recurring patterns in anomalous observations, which might be due, for example, to scheduled maintenance activities. This method is based on stricter assumptions than the first one, but achieves fast execution times and accurate results. Moreover, it directly provides an insight on the behavior of the system with no need for visual inspection of the results.

Il capacity planning e i modelli di performance fanno affidamento sulla comprensione della storia passata di un sistema informatico per predirne l'utilizzo quando sottoposto a nuovi workload. Secondo la legge dell'utilizzo, esiste un legame lineare tra il throughput e l'utilizzo e le loro misurazioni possono essere usate per stimare il tempo di servizio, un parametro di cruciale importanza per i modelli a reti di code. Tuttavia, a causa di cambi di configurazione avventi durante il periodo di osservazione, le misure di workload e utilizzo tendono a raggrupparsi in più di una struttura lineare. Per stimare il tempo di servizio dei modelli prestazioni sottostanti, è necessario identificare le diverse configurazioni. In questa tesi presentiamo due diversi algoritmi per effettuare la stima dei parametri quando la configurazione del sistema è soggetta a cambiamenti. Entrambi i metodi sono progettati per tollerare outliers, basandosi su tecniche della statistica robusta. Il primo algoritmo si basa soltanto sui dati osservati e stima contemporaneamente il numero di clusters, l'appartenenza ai clusters e le rette di regressione. Questo metodo ottiene risultati migliori dei precedenti lavori sulla clusterwise regression ed è basato su tre fasi: clustering basato sulla densità, separazione in cluster lineari e unificazione basata sulla distribuzione dei residui. Questo metodo è stato esteso con una procedura di visual mining per evidenziare la relazione tra l'appartenenza ai cluster e i timestamps. Il secondo approccio sfrutta i timestamps direttamente durante il clustering per determinare i punti di cambiamento e gli schemi ricorrenti tra le osservazioni anomale, che possono essere causate, per esempio, da attività di manutenzione pianificate. Questo metodo è basato su assunzioni più ristrette rispetto al primo, ma ottiene brevi tempi di esecuzione e risultati precisi. Inoltre, offre direttamente delle informazioni sul comportamento del sistema senza bisogno di un'analisi visuale dei risultati.