Exploiting instance segmentation for semantic anomaly detection

Anomaly detection is the problem of identifying data samples that do not belong to the normal data distribution. While literature on the topic is rather consolidated, there is a lack of studies on the specific problem of semantic anomaly detection. We define semantic anomalies as violations of logical or geometrical constraint that define the normal distribution. For example, a card from a standard 52-cards deck is not supposed to contain more than one suit at once, and it always has a symbol indicating its value and suit in the top left and bottom right corners. There are three main contributions to this work: the first is the analysis of a state-of-the-art approach to unsupervised anomaly detection; the second is the development of a working implementation of this approach, which was released by the authors without any open-source code; the third is the design of a novel algorithm that exploits instance segmentation to enhance the output of the previous work at inference time, while preserving its unsupervised nature. The analysis we conducted highlights the limitations of the previous work, which are mainly related to the low expressive power of its architecture and to the nature of the model used for knowledge distillation, which is trained to perform a task that requires no semantic information. Our method outperforms the previous work in our tests under the correct assumptions, proving that instance segmentation indeed carries useful information for semantic anomaly detection. Future developments may see the use of instance segmentation data at training time instead of inference time. Another promising direction would be that of performing knowledge distillation on a network that is specifically trained to extract semantic information from the input image. Finally, we believe that multimodal large language models might play a part in the future of semantic anomaly detection, given their ability to explain the content of an image in plain text.

L'individuazione delle anomalie è il problema di identificare campioni di dati che non appartengono alla distribuzione normale. Sebbene la letteratura su questo argomento sia piuttosto consolidata, mancano studi specifici sul problema della rilevazione delle anomalie semantiche. Definiamo le anomalie semantiche come violazioni di vincoli logici o geometrici che definiscono la distribuzione normale. Ad esempio, una carta da un normale mazzo di 52 carte non dovrebbe contenere più di un seme contemporaneamente e presenta sempre un simbolo che indica il suo valore e seme negli angoli in alto a sinistra e in basso a destra. Questo lavoro apporta tre principali contributi: il primo è l'analisi di un approccio state-of-the-art all'individuazione di anomalie non supervisionata; il secondo è lo sviluppo di un'implementazione funzionante di questo approccio, rilasciata dagli autori senza alcun codice open source; il terzo è la progettazione di un nuovo algoritmo che sfrutta la segmentazione per istanze per migliorare l'output del lavoro precedente durante l'inferenza, preservando al contempo la sua natura non supervisionata. L'analisi condotta mette in evidenza le limitazioni del lavoro precedente, principalmente legate alla bassa potenza espressiva della sua architettura e alla natura del modello utilizzato per la knowledge distillation, il quale è addestrato a svolgere un compito che non richiede informazioni semantiche. Il nostro metodo supera il lavoro precedente nei nostri test sotto le corrette ipotesi, dimostrando che la segmentazione per istanze porta effettivamente informazioni utili per l'individuazione di anomalie semantiche. Sviluppi futuri potrebbero vedere l'uso dei dati di segmentazione per istanze durante il training anziché durante l'inferenza. Un'altra direzione promettente sarebbe quella di eseguire la knowledge distillation su una rete specificamente addestrata per estrarre informazioni semantiche dall'immagine in input. Infine, riteniamo che modelli linguistici multimodali possano giocare un ruolo importante nel futuro della rilevazione semantica delle anomalie, data la loro capacità di spiegare il contenuto di un'immagine in linguaggio semplice.