Colonoscopy is a critical procedure for detecting and characterizing colorectal lesions; however, deep learning models applied in this domain often operate as “black boxes” and require extensive annotated data—resources that are scarce in medical imaging. In this thesis, we introduce a novel framework that integrates bounding-box-based attention supervision with a parameter-efficient fine-tuning approach called Low-Rank Adaptation (LoRA), using a Vision Transformer backbone (DINOv2). By steering the model’s self-attention toward clinically relevant lesion regions and reducing the number of trainable parameters, our goal is to enhance both interpretability and efficiency in data-constrained colonoscopy settings. We conduct extensive experiments on a colonoscopic image dataset derived from PIBAdb. Our findings indicate that bounding-box supervision generally produces more focused and interpretable attention maps, and in certain configurations—specifically when guidance is applied to one or two early Transformer layers—it can modestly improve classification accuracy over a baseline model trained with standard cross-entropy. However, imposing supervision on deeper or final layers proves suboptimal and can degrade performance considerably in some trials. Additionally, while LoRA effectively reduces the number of learnable parameters, our current configurations exhibit a noticeable drop in classification accuracy compared to fully fine-tuned models, suggesting that additional hyperparameter optimization is required. Overall, this work highlights important trade-offs between localized attention and global context in colonoscopic image classification. It further underscores the need for refined strategies—such as adaptive bounding-box constraints and targeted LoRA implementations—to develop deep learning systems that are both interpretable and robust, ultimately facilitating safer and more efficient adoption in clinical workflows.
La colonscopia rappresenta una procedura fondamentale per l’identificazione e la caratterizzazione delle lesioni del colon; tuttavia, i modelli di deep learning applicati in questo ambito operano spesso come “scatole nere” e richiedono un ingente quantitativo di dati annotati, raramente disponibili in campo medico. In questa tesi proponiamo un framework che integra la supervisione basata su bounding box, al fine di guidare l’attenzione del modello sulle regioni clinicamente rilevanti, con un approccio di fine-tuning efficiente in termini di parametri, denominato Low-Rank Adaptation (LoRA), utilizzando come backbone un Vision Transformer (DINOv2). L’obiettivo è migliorare l’interpretabilità e ottimizzare l’uso di risorse computazionali, in contesti in cui i dati a disposizione sono limitati. Le sperimentazioni, condotte su un dataset di immagini colonoscopiche derivato da PIBAdb, mostrano che la supervisione tramite bounding box produce mappe di attenzione più focalizzate e interpretabili; in alcune configurazioni—soprattutto quando il vincolo è applicato a uno o due strati iniziali del Transformer—si osserva anche un lieve miglioramento dell’accuratezza rispetto a un modello baseline addestrato con la sola cross-entropy. Al contrario, imporre la supervisione su strati più profondi o sullo strato finale risulta subottimale e può degradare sensibilmente le prestazioni in talune prove. Inoltre, pur riducendo significativamente il numero di parametri da apprendere, la configurazione di LoRA utilizzata rivela un calo non trascurabile dell’accuratezza rispetto al fine-tuning completo, suggerendo la necessità di un’ulteriore ottimizzazione iperparametrica. Complessivamente, il lavoro evidenzia i delicati compromessi tra l’attenzione localizzata sulle lesioni e la necessità di catturare il contesto globale nelle immagini colonoscopiche. Inoltre, sottolinea l’importanza di strategie raffinate—come l’uso di vincoli adattativi sulle bounding box e di implementazioni mirate di LoRA—per realizzare modelli di deep learning al tempo stesso interpretabili e robusti, favorendone un’adozione più sicura ed efficiente in ambito clinico.
Evaluation of attention-guided training strategies for vision transformers in colonoscopy
DI STEFANO, LUCA
2024/2025
Abstract
Colonoscopy is a critical procedure for detecting and characterizing colorectal lesions; however, deep learning models applied in this domain often operate as “black boxes” and require extensive annotated data—resources that are scarce in medical imaging. In this thesis, we introduce a novel framework that integrates bounding-box-based attention supervision with a parameter-efficient fine-tuning approach called Low-Rank Adaptation (LoRA), using a Vision Transformer backbone (DINOv2). By steering the model’s self-attention toward clinically relevant lesion regions and reducing the number of trainable parameters, our goal is to enhance both interpretability and efficiency in data-constrained colonoscopy settings. We conduct extensive experiments on a colonoscopic image dataset derived from PIBAdb. Our findings indicate that bounding-box supervision generally produces more focused and interpretable attention maps, and in certain configurations—specifically when guidance is applied to one or two early Transformer layers—it can modestly improve classification accuracy over a baseline model trained with standard cross-entropy. However, imposing supervision on deeper or final layers proves suboptimal and can degrade performance considerably in some trials. Additionally, while LoRA effectively reduces the number of learnable parameters, our current configurations exhibit a noticeable drop in classification accuracy compared to fully fine-tuned models, suggesting that additional hyperparameter optimization is required. Overall, this work highlights important trade-offs between localized attention and global context in colonoscopic image classification. It further underscores the need for refined strategies—such as adaptive bounding-box constraints and targeted LoRA implementations—to develop deep learning systems that are both interpretable and robust, ultimately facilitating safer and more efficient adoption in clinical workflows.File | Dimensione | Formato | |
---|---|---|---|
2025_04_DiStefano_Thesi.pdf
accessibile in internet per tutti
Descrizione: tesi
Dimensione
9.15 MB
Formato
Adobe PDF
|
9.15 MB | Adobe PDF | Visualizza/Apri |
2025_04_DiStefano_ExecutiveSummary.pdf
accessibile in internet per tutti
Descrizione: executive summary
Dimensione
3.2 MB
Formato
Adobe PDF
|
3.2 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/236394