Solid waste generation is predicted to grow from 2.3 billion tonnes in 2023 to 3.8 billion tonnes by 2050, underscoring the critical need for effective waste management strategies. Computer vision systems have shown promise in improving the efficiency of waste sorting, reducing human error and operational costs. Waste image segmentation is a fundamental task in this context. While fully supervised methods are known for their efficacy in these tasks, pixel-level annotations for training must be obtained manually segmenting a large number of images, making them extremely costly to produce from both an economic and temporal perspective. For this reason, we propose a novel deep-learning framework, framing the problem as a weakly supervised video segmentation task. Manual sorting is usually employed to remove objects of different categories from a stream of materials that must be retained running on a conveyor belt. Our goal is to extract the information intrinsic to this process and represent it into a format that is useful for training a network to segment the objects that need to be removed. To this end I personally collaborated in mounting a dual-camera setup on a conveyor belt in a recycling plant, capturing videos before and after the manual removal. By training a classifier to distinguish between "before" and "after" images, the system learns to recognize the features specific to the "before" images, which correspond to the presence of objects that need to be removed. The proposed method leverages both the temporal and spatial consistency intrinsic in the data to identify and segment objects that must be removed from the belt, utilizing only video-level tags as supervision. An innovative technique based on CAMs and optical flow is introduced, pushing the classifier to generate CAMs that are coherent between near frames, directly in the training phase. In this way, this method allows for efficient and consistent segmentation without requiring pixel-level annotations. Our results demonstrate that our method outperforms traditional CAM-based methods and achieves better results by leveraging temporal consistency in videos directly during training phase.
La produzione di rifiuti solidi crescerà da 2.3, nel 2023, a 3.8 miliardi di tonnellate entro il 2050, evidenziando la necessità di strategie efficaci di gestione e selezione dei rifiuti. I sistemi di computer vision si sono dimostrati promettenti in questo campo, riducendo errori umani i costi operativi. In questo contesto, segmentare rifiuti nelle immagini è fondamentale. Pur essendo i metodi fully-supervised noti per la loro efficacia, le annotazioni a livello di pixel per l'addestramento vanno ottenute manualmente, segmentando un gran numero di immagini. Ciò le rende estremamente costose da produrre sia da un punto di vista economico che temporale. Proponiamo quindi un nuovo framework di deep-learning, inquadrando il problema come un task di segmentazione video debolmente supervisionata. Lo smistamento manuale è utilizzato per rimuovere oggetti di tipi differenti da un flusso di materiali su un nastro trasportatore che devono essere invece mantenuti . Il nostro obiettivo è estrarre le informazioni intrinseche a questo processo e rappresentarle in un formato utile per addestrare una rete a segmentare gli oggetti da rimuovere. A tal fine ho collaborato personalmente al montaggio di un setup con due telecamere su un nastro trasportatore in un impianto di riciclaggio, catturando video prima e dopo la rimozione manuale. Addestrando un classificatore a distinguere tra immagini "prima" e "dopo", il sistema impara a riconoscere le caratteristiche specifiche delle immagini "prima", che corrispondono alla presenza di oggetti da rimuovere. Il metodo proposto sfrutta la coerenza temporale e spaziale intrinseca nei dati per identificare e segmentare gli oggetti da rimuovere, con solo tag a livello di video come supervisione. Un'innovativa tecnica basata su CAM e flusso ottico spinge il classificatore a generare CAM coerenti tra frames vicini direttamente nella fase di addestramento. In questo modo il metodo consente una segmentazione efficiente e senza richiedere annotazioni a livello di pixel. I risultati dimostrano che il nostro metodo supera quelli tradizionali basati su CAM sfruttando la coerenza temporale dei video direttamente durante la fase di addestramento.
Waste identification in sorting processes via weakly supervised video segmentation
Marelli, Andrea
2023/2024
Abstract
Solid waste generation is predicted to grow from 2.3 billion tonnes in 2023 to 3.8 billion tonnes by 2050, underscoring the critical need for effective waste management strategies. Computer vision systems have shown promise in improving the efficiency of waste sorting, reducing human error and operational costs. Waste image segmentation is a fundamental task in this context. While fully supervised methods are known for their efficacy in these tasks, pixel-level annotations for training must be obtained manually segmenting a large number of images, making them extremely costly to produce from both an economic and temporal perspective. For this reason, we propose a novel deep-learning framework, framing the problem as a weakly supervised video segmentation task. Manual sorting is usually employed to remove objects of different categories from a stream of materials that must be retained running on a conveyor belt. Our goal is to extract the information intrinsic to this process and represent it into a format that is useful for training a network to segment the objects that need to be removed. To this end I personally collaborated in mounting a dual-camera setup on a conveyor belt in a recycling plant, capturing videos before and after the manual removal. By training a classifier to distinguish between "before" and "after" images, the system learns to recognize the features specific to the "before" images, which correspond to the presence of objects that need to be removed. The proposed method leverages both the temporal and spatial consistency intrinsic in the data to identify and segment objects that must be removed from the belt, utilizing only video-level tags as supervision. An innovative technique based on CAMs and optical flow is introduced, pushing the classifier to generate CAMs that are coherent between near frames, directly in the training phase. In this way, this method allows for efficient and consistent segmentation without requiring pixel-level annotations. Our results demonstrate that our method outperforms traditional CAM-based methods and achieves better results by leveraging temporal consistency in videos directly during training phase.File | Dimensione | Formato | |
---|---|---|---|
2024_07_Marelli_Executive_Summary.pdf
accessibile in internet per tutti a partire dal 30/06/2025
Descrizione: Testo dell'executive summary
Dimensione
5.88 MB
Formato
Adobe PDF
|
5.88 MB | Adobe PDF | Visualizza/Apri |
2024_07_Marelli_Tesi.pdf
accessibile in internet per tutti a partire dal 30/06/2025
Descrizione: Testo della tesi
Dimensione
30.84 MB
Formato
Adobe PDF
|
30.84 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/223845