As geospatial data continuously grows in complexity and size, the application of Machine Learning and Data Mining techniques to geospatial analysis is increasingly more essential to solve real world problems. Although, in the last two decades, the research in this field produced innovative methodologies, they are usually applied to specific situations and not automatized for general use. Therefore, both generalization and integration of these methods with Geographic Information Systems are necessary to support researchers and organizations in data exploration, pattern recognition, and prediction in the various applications of geospatial data. The lack of machine learning tools in GIS is especially clear for what concerns unsupervised learning and clustering. In this work we present a plugin, ready to be published, that we developed for the open-source software QGIS and offers functionalities for the entire cluster analysis process: from (i) pre-processing, to (ii) feature selection and clustering, and finally (iii) cluster evaluation. Our tool provides different improvements from the current solutions available in QGIS, but also in other widespread GIS. The expanded features provided by the plugin allow the users to deal with some of the most challenging problems of geospatial data, such as high dimensional space, poor quality of data, and large size of data. Another important objective of the research is the accessibility and ease of use of the plugin, since the general user of GIS is often lacking a machine learning and computer science background. To assess the strengths and weaknesses of the program, we will cover numerous experiments with real world situations on data from the city of Milan. The datasets for the experiments are of different nature (i.e., climatic, urban, and socio-demographic) and different sizes, ranging from less than 100 data points to almost 70000, and with a large number of numerical attributes, up to 109. Overall, the experimental phase shows good and adequate flexibility of the plugin, and outlines the possibilities for future developments that can be provided also by the QGIS community, given the open-source nature of the project.
Con la continua crescita delle dimensioni e della complessità dei dati geospaziali, l'applicazione delle tecniche di Machine Learning e Data Mining all'analisi spaziale sta acquisendo un ruolo sempre pìù centrale nella risoluzione dei problemi reali. Nonostante, negli ultime due decenni, le ricerche in questo campo abbiano sviluppato metodologie innovative, queste vengono solitamente applicate a contesti specifici e non sono automatizzate per l'uso generale. Perciò, la generalizzazione e l'integrazione di questi metodi con i sistemi informativi geografici sono necessarie per sostenere i ricercatori e le organizzazioni nell’esplorazione dei dati, il riconoscimento di pattern, e la predizione nelle varie applicazioni dei dati geospaziali. La mancanza di strumenti per il machine learning nei GIS è evidente soprattutto per quanto riguarda l’apprendimento non supervisionato e il clustering. In questo lavoro presentiamo un plugin sviluppato per il software open-source QGIS, che include funzionalità per l’intero processo di clustering: dalla (i) pre-elaborazione, alla (ii) selezione delle caratteristiche e il clustering, e infine la (iii) valutazione del clustering. Il nostro programma introduce diverse innovazioni rispetto alle soluzioni presenti attualmente in QGIS e altri GIS molto diffusi. Le funzioni aggiuntive del plugin permettono agli utenti di affrontare le problematiche più comuni dei dati geospaziali, come l'alto numero di dimensioni, la scarsa qualità, e la grandezza dei dati. Altri obiettivi rilevanti del nostro lavoro sono l’accessibilità e la facilità di utilizzo, in quanto gli utenti dei GIS spesso non hanno molte conoscenze informatiche e di machine learning. Per valutare i punti di forza e di debolezza del programma, mostreremo numerosi esperimenti con situazioni reali su dati della città di Milano. I dati utilizzati per gli esperimenti sono di diversa natura (climatici, urbani e socio demografici) e di diverse grandezze, passando da meno di 100 osservazioni a quasi 70000, e con un ampio numero di attributi, fino a 109. Complessivamente, la fase sperimentale mostra l'ottima flessibilità del plugin, e delinea le possibilità per gli sviluppi futuri che possono essere apportati anche dalla comunità di QGIS, vista la natura open-source del progetto.
Integrating machine learning techniques into GIS software : development of a comprehensive and versatile QGIS plugin for cluster analysis on geospatial data
FOLINI, ANDREA
2020/2021
Abstract
As geospatial data continuously grows in complexity and size, the application of Machine Learning and Data Mining techniques to geospatial analysis is increasingly more essential to solve real world problems. Although, in the last two decades, the research in this field produced innovative methodologies, they are usually applied to specific situations and not automatized for general use. Therefore, both generalization and integration of these methods with Geographic Information Systems are necessary to support researchers and organizations in data exploration, pattern recognition, and prediction in the various applications of geospatial data. The lack of machine learning tools in GIS is especially clear for what concerns unsupervised learning and clustering. In this work we present a plugin, ready to be published, that we developed for the open-source software QGIS and offers functionalities for the entire cluster analysis process: from (i) pre-processing, to (ii) feature selection and clustering, and finally (iii) cluster evaluation. Our tool provides different improvements from the current solutions available in QGIS, but also in other widespread GIS. The expanded features provided by the plugin allow the users to deal with some of the most challenging problems of geospatial data, such as high dimensional space, poor quality of data, and large size of data. Another important objective of the research is the accessibility and ease of use of the plugin, since the general user of GIS is often lacking a machine learning and computer science background. To assess the strengths and weaknesses of the program, we will cover numerous experiments with real world situations on data from the city of Milan. The datasets for the experiments are of different nature (i.e., climatic, urban, and socio-demographic) and different sizes, ranging from less than 100 data points to almost 70000, and with a large number of numerical attributes, up to 109. Overall, the experimental phase shows good and adequate flexibility of the plugin, and outlines the possibilities for future developments that can be provided also by the QGIS community, given the open-source nature of the project.File | Dimensione | Formato | |
---|---|---|---|
2021_12_Folini.pdf
accessibile in internet per tutti
Descrizione: Executive Summary e Tesi
Dimensione
6.1 MB
Formato
Adobe PDF
|
6.1 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/183530