Over the past decade, deep learning models have achieved state-of-the-art results in 2D computer vision tasks such as object detection, localization, classification, and scene understanding. Given the success of the models in 2D, they have more recently been applied to 3D scene understanding. We aim to adapt this technique to the agricultural field, especially in vineyards, by developing a 3D scene understanding model that leverages multiple input sources, including text, RGB images, and LiDAR data. The agricultural domain presents distinct challenges, such as the limited availability of datasets and the unpredictable nature of vegetation. With this work, we want to create a tool that addresses the significant problem of the weakening of the agricultural workforce by simplifying the working conditions for the farmers. By equipping robots with our 3D scene understanding technology, we aim to enhance crop monitoring efficiency and ease. Additionally, we can create a field history that can be used to track vine changes, helping reduce pesticide use by identifying plant disease patterns and productivity variations, and enabling timely farmer intervention. We start by comparing two models deriving from the literature, respectively, CLIP2Scene, a model able to perform 3D scene understanding by leveraging the knowledge of CLIP and Multimodal interlaced Model (MIT), which performs 3D scene understanding using weakly supervised learning techniques, allowing it to use less annotated data during the training. Results of both models show that they work well when the data quality is high. The problems arose when we used one of the few available data in the agricultural domain, the BLT dataset. Its data has been acquired with a low-resolution LIDAR that makes those models struggle and lower performances, so we designed an Ensemble Model able to handle low-quality data to address it. It combines CLIP2Scene, a state-of-the-art model capable of handling 3D data, and Zero-Shot-Net, which is an ad-hoc designed network able to manage 2D and text data. Finally, we present a method capable of generating a complete map, of a vineyard in our case, using only GPS coordinates and 3D point clouds, allowing us to have a full view of the entire map using few and easily collectible data.
Nell'ultimo decennio, i modelli di deep learning hanno ottenuto risultati all'avanguardia in compiti di computer vision 2D, come object detection, localization e classification, e scene understanding. Dato il successo dei modelli in 2D, sono stati applicati anche al 3D scene understanding. Il nostro obiettivo è quello di adattare questa tecnica al settore agricolo, in particolare ai vigneti, sviluppando un modello di 3D scene understanding che sfrutta diversi input, tra cui testo, immagini RGB e dati LiDAR. Il settore agricolo presenta diverse sfide, come la disponibilità limitata di dataset e la natura imprevedibile della vegetazione. Con questa tesi, vogliamo creare uno strumento in grado di affrontare il problema della significativa perdita di forza lavoro agricola che si sta verificando ultimamente, semplificando le condizioni di lavoro degli agricoltori. Equipaggiando robot con la nostra tecnologia di 3D scene understanding, intendiamo migliorare l'efficienza e la facilità del monitoraggio delle colture. Inoltre, possiamo anche creare uno storico dell'evoluzione del campo che può essere utilizzato per seguire i cambiamenti delle viti, contribuendo a ridurre l'uso di pesticidi, identification gli schemi di malattia delle piante e le variazioni di produttività e consentendo un intervento tempestivo da parte degli agricoltori. Il lavoro comincia confrontando due modelli derivati dalla letteratura, rispettivamente CLIP2Scene, un modello in grado di eseguire 3D scene understanding sfruttando le conoscenze di CLIP e Multimodal Interlaced Model (MIT), che esegue 3D scene understanding utilizzando weakly supervised learning, che gli consentono di usare meno dati annotati durante l'addestramento. I risultati di entrambi i modelli mostrano che lavorano bene quando la qualità dei dati è elevata. I problemi sono sorti quando abbiamo usato uno dei pochi dati disponibili nel settore agricolo, il dataset BLT. I suoi dati sono stati acquisiti con un LIDAR a bassa risoluzione che mette in difficoltà i modelli e ne riduce le prestazioni, per cui abbiamo progettato l'Ensemble Model, in grado di gestire anche dati di bassa qualità. Esso combina CLIP2Scene, un modello all'avanguardia in grado di gestire dati 3D, e Zero-Shot-Net, una rete progettata ad hoc in grado di gestire dati 2D e testuali. Infine, presentiamo un metodo in grado di generare una mappa completa, nel nostro caso di un vigneto, utilizzando solo coordinate GPS e point clouds 3D. Questo approccio ci permette di avere una visione completa dell'intera mappa utilizzando pochi dati facilmente reperibili.
3D Scene Understanding in Agricultural Environment
Braccini, Alessio
2023/2024
Abstract
Over the past decade, deep learning models have achieved state-of-the-art results in 2D computer vision tasks such as object detection, localization, classification, and scene understanding. Given the success of the models in 2D, they have more recently been applied to 3D scene understanding. We aim to adapt this technique to the agricultural field, especially in vineyards, by developing a 3D scene understanding model that leverages multiple input sources, including text, RGB images, and LiDAR data. The agricultural domain presents distinct challenges, such as the limited availability of datasets and the unpredictable nature of vegetation. With this work, we want to create a tool that addresses the significant problem of the weakening of the agricultural workforce by simplifying the working conditions for the farmers. By equipping robots with our 3D scene understanding technology, we aim to enhance crop monitoring efficiency and ease. Additionally, we can create a field history that can be used to track vine changes, helping reduce pesticide use by identifying plant disease patterns and productivity variations, and enabling timely farmer intervention. We start by comparing two models deriving from the literature, respectively, CLIP2Scene, a model able to perform 3D scene understanding by leveraging the knowledge of CLIP and Multimodal interlaced Model (MIT), which performs 3D scene understanding using weakly supervised learning techniques, allowing it to use less annotated data during the training. Results of both models show that they work well when the data quality is high. The problems arose when we used one of the few available data in the agricultural domain, the BLT dataset. Its data has been acquired with a low-resolution LIDAR that makes those models struggle and lower performances, so we designed an Ensemble Model able to handle low-quality data to address it. It combines CLIP2Scene, a state-of-the-art model capable of handling 3D data, and Zero-Shot-Net, which is an ad-hoc designed network able to manage 2D and text data. Finally, we present a method capable of generating a complete map, of a vineyard in our case, using only GPS coordinates and 3D point clouds, allowing us to have a full view of the entire map using few and easily collectible data.File | Dimensione | Formato | |
---|---|---|---|
2024_07_Braccini_Executive_Summary.pdf
accessibile in internet per tutti
Dimensione
5.07 MB
Formato
Adobe PDF
|
5.07 MB | Adobe PDF | Visualizza/Apri |
2024_07_Braccini_Tesi.pdf
accessibile in internet per tutti
Dimensione
25.61 MB
Formato
Adobe PDF
|
25.61 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/223146