Data reproduction in smart buildings utilising optimized sliding window based machine learning models

In the present thesis, a machine-learning based procedure for re-creating the missing data in the smart building context has been proposed and implemented. The dataset that has been obtained utilizing various points of measurement (in indoor spaces and the HVAC system) in a net zero energy building (NZEB) has been employed as the case study and two scenarios have been accordingly considered. In the first scenario, the re-creation of the missing data (up to 1 hour) is performed while using the ex ante data from the same sensor along with the status flag of the HVAC system and the ambient conditions. In the second scenario instead, apart from the latter parameters, the data obtained from the sensors located in two close-by indoor spaces is also added as additional inputs. For each scenario, several machine learning based pipelines with different prediction horizons have been developed and optimized. For each pipeline, Random forests algorithm is first employed as the benchmark algorithm and the feature selection procedure is then used to determine the most promising set of features. Afterwards, an algorithm and hyper-parameter optimization procedure (based on genetic algorithm) is performed. Next, using the obtained optimal pipeline, sliding window-based training scheme is implemented and the corresponding size of the training window is optimized. Finally, various intervals with missing data are considered and the estimations of the optimized pipelines are employed to re-create the missing values. The obtained results demonstrate that for scenario I, using the obtained pipeline with prediction horizons of 5,10, 15, 30, 45, and 60 minutes, mean absolute relative deviation (MARD) of 0.08%, 0.10%, 0.13%, 0.22% 0.33% and 0.40% respectively have been obtained (for the test set). For scenario II, expectedly a notably higher forecasting performance is achieved resulting in a MARD of 0.05%, 0.06%, 0.07%, 0.10%, 0.17%, and 0.23% for the above-mentioned prediction horizons. Moreover, it can be noted that performing the feature selection and algorithm optimization procedures along with the implementation of the sliding window-based training scheme notably improve the achieved forecasting performance. Accordingly, the proposed methodology can be employed as an efficient data reproduction tool in the intervals in which sensor failure occurs on smart buildings.

Nella presente tesi, è stata proposta e implementata una procedura basata sull'apprendimento automatico per ricreare i dati mancanti nel contesto degli smart building. Il set di dati ottenuto utilizzando vari punti di misura (negli spazi interni e nel sistema HVAC) in un edificio a energia consumata zero è stato utilizzato come caso di studio e sono stati considerati due scenari. Nel primo scenario, la ricreazione dei dati mancanti (fino a 1 ora) viene effettuata utilizzando i dati ex ante dello stesso sensore insieme al flag di stato del sistema HVAC e alle condizioni ambientali. Nel secondo scenario, invece, oltre a questi ultimi parametri, vengono aggiunti come input aggiuntivi i dati ottenuti dai sensori situati in due spazi interni vicini. Per ogni scenario sono state sviluppate e ottimizzate diverse pipeline basate sul machine learning con diversi orizzonti di previsione. Per ogni pipeline, l'algoritmo Random forests è stato utilizzato come algoritmo di riferimento e la procedura di selezione delle caratteristiche è stata utilizzata per determinare l'insieme di caratteristiche più promettenti. Successivamente, viene eseguita una procedura di ottimizzazione dell'algoritmo e degli iperparametri (basata su un algoritmo genetico). Quindi, utilizzando la pipeline ottimale ottenuta, si implementa uno schema di addestramento basato su finestre scorrevoli e si ottimizza la dimensione corrispondente della finestra di addestramento. Infine, si considerano vari intervalli con dati mancanti e si utilizzano le stime delle pipeline ottimizzate per ricreare i valori mancanti. I risultati ottenuti dimostrano che per lo scenario I, utilizzando la pipeline ottenuta con orizzonti di previsione di 5, 10, 15, 30, 45 e 60 minuti, si è ottenuta una deviazione relativa media assoluta (MARD) rispettivamente di 0,08%, 0,10%, 0,13%, 0,22% 0,33% e 0,40% (per il set di test). Per lo scenario II, come ci si aspetta, una performance di previsione notevolmente superiore viene ottenuta, con un MARD di 0,05%, 0,06%, 0,07%, 0,10%, 0,17% e 0,23% per i suddetti orizzonti di previsione. Inoltre, si può notare che l'esecuzione delle procedure di selezione delle feature e di ottimizzazione dell'algoritmo, insieme all'implementazione dello schema di addestramento basato su finestre scorrevoli, migliora notevolmente le prestazioni di previsione ottenute. Di conseguenza, la metodologia proposta può essere impiegata come un efficiente strumento di riproduzione dei dati negli intervalli in cui si verificano guasti ai sensori degli edifici intelligenti.