Machine learning techniques for the estimation of the soil moisture from satellite data

The focus of this study is to evaluate and compare different types of machine learning algorithms for accurately estimating surface soil moisture using satellite data. The work aims to identify the most effective architecture for this purpose, striving to create a globally applicable "tool" that can be used in areas where in-situ stations are not present. The chosen study area is the TxSON network in Texas, USA, characterized by arid conditions, uniform features, and sparse vegetation. The research covers a four-year period from January 1, 2018, to December 31, 2021. To conduct this analysis, images with dual-polarized radar back-scatter (VH and VV polarizations) have been extracted from Sentinel-1, while red and near-infrared bands from Sentinel-2 have been used to calculate the Normalized Difference Vegetation Index (NDVI). In total 115 satellite observations have been collected. In addition to satellite data, the study also incorporates in-situ data from the ISMN database, to retrieve soil moisture hourly time series. These data are later used for aligning and refining the machine learning models. Then the collected data have been partitioned into training and inference sets to develop a comprehensive database for analysis. The work evaluates various ML algorithms, including Linear, Random Forest (RF), Support Vector Machine (SVM), Gaussian Process Regression (GPR), Multi-Layer Perceptron (MLP) and others, with the aim to fine-tune the hyperparameters of these models to achieve the lowest possible Root Mean Square Error (RMSE), which serves as a measure of the accuracy of the models’ predictions. However, after conducting the entire process and analyzing the outcomes, the research acknowledges that the results didn’t align with the intended objectives. In fact, the most noteworthy finding is the ’discovery’ that this workflow, specifically involving the training of algorithms for predicting soil moisture values, demonstrates its effectiveness when applied in a ’localized’ approach. The task of training a ML model for a specific site and accurately predicting values in an different area, in order to achieve the initial goal of the research, the global tool, appears to be seemingly impossible.

L’obiettivo di questo studio è valutare e confrontare diversi tipi di algoritmi di machine learning per stimare con precisione l’umidità superficiale del suolo utilizzando dati satellitari. Il lavoro mira a identificare l’architettura più efficace per questo scopo, cercando di creare uno "strumento" globalmente applicabile che possa essere utilizzato in aree dove non sono presenti stazioni in-situ. L’area di studio scelta è il network di TxSON in Texas, USA, caratterizzato da condizioni aride, caratteristiche uniformi e vegetazione scarsa. La ricerca copre un periodo di quattro anni, dal 1 gennaio 2018 al 31 dicembre 2021. Per condurre questa analisi, sono state estratte immagini con retrodiffusione radar a doppia polarizzazione (polarizzazioni VH e VV) da Sentinel-1, mentre le bande del rosso e del vicino-infrarosso da Sentinel-2 sono state utilizzate per calcolare l’Indice di Vegetazione della Differenza Normalizzata (NDVI). In totale sono state raccolte 115 osservazioni satellitari. Oltre ai dati satellitari, lo studio incorpora anche le serie temporali orarie di umidità del suolo presenti sul database ISMN. Questi dati vengono successivamente utilizzati per addestrare i modelli di machine learning. Successivamente, i dati raccolti sono stati suddivisi in subset di addestramento e di inferenza. Il lavoro valuta diversi algoritmi di machine learning, tra cui Lineare, Random Forest (RF), Support Vector Machine (SVM), Gaussian Process Regression (GPR), MultiLayer Perceptron (MLP) e altri, con l’obiettivo di ottimizzare i loro iperparametri per ottenere il più basso Errore Quadratico Medio (RMSE) possibile, che funge da misura dell’accuratezza delle previsioni dei modelli. Tuttavia, dopo aver condotto l’intero processo e analizzato i risultati, si è notato che i risultati non sono in linea con gli obiettivi prefissati. In realtà, la scoperta più significativa è che questo flusso di lavoro dimostra la sua efficacia quando viene applicato in un approccio "localizzato". L’addestrare un modello di machine learning per un sito specifico e utilizzarlo per prevedere con precisione i valori in un’area diversa, al fine di raggiungere l’obiettivo iniziale di creare uno strumento globale, sembra essere apparentemente impossibile.