Multiple augmented CAM : a general framework for interpreting CNN outputs through image augmentation and multi-frame super-resolution

This thesis has been elaborated during an internship at STMicroelectronics in Agrate Brianza (MB). The work stems from the company's need to study techniques for interpreting the results provided by Convolutional Neural Networks (CNNs) used for quality control. The company is a leader in the semiconductor sector and in the production of chips. The latter are obtained from silicon wafers, whose production is a long process composed of several steps. Defects can generally arise in wafers during each production phase, but they are problematic when arranged in specific patterns, because they might indicate production issues, that should be promptly recognized and repaired, in order to avoid large waste of resources. For this reason, to recognize potential defective patterns, expert production engineers analyse wafer images generated by inspection machines along the production line. Recently the company has decided to automate the control phase using a Convolutional Neural Network (CNN), in order to automatically classify defective patterns. The number of classes are 13 and the architecture used in this work is based on Submanifold Sparse Convolutions to handle wafer images at very high resolution (20,000x20,000 pixels). Since recognizing a class is crucial to take the right corrective actions, relying only on the model prediction is not sufficient. For this reason interpretability methods are paramount to verify the presence of the predicted pattern. State-of-the-art interpretability techniques severely upsample (e.g. with upsampling ratio of 32) a low resolution heat-map computed from the input, obtaining a high resolution heat-map, which, superimposed to the input image, shows the regions that mostly influence the model output. Instead of upsampling a single low resolution heat-map, our intuition is to aggregate several low resolution heat-maps computed from augmented images (i.e. images generating through data augmentation) of the input. Therefore we propose Multiple Augmented CAM (MACAM), a general framework that aggregates low resolution heat-maps, independently from the algorithm used to computed them, into a single high resolution map using Multi-Frame Super-Resolution and giving greater importance to maps generated by more representative images of the predicted class, through the use of weights. We also study the effectiveness of selecting the transformations used during the data augmentation, in order to include meaningful images related to the predicted class. Furthermore we discuss the metrics for a fair comparison between methods that compute high resolution heat-maps. We show that our approach generates heat-maps of higher quality and more localized on the predicted objects in images than state-of-the-art methods, providing an interpretability tool useful to better understand the CNN outputs. We discuss the results for real-world images in the ILSVRC dataset and for industrial sparse images of the defects in wafers, provided by STMicroelectronics.

Il lavoro di tesi presentato è stato prodotto durante il tirocinio presso lo stabilimento di STMicroelectronics ad Agrate Brianza (MB). Il lavoro nasce dall'esigenza dell'azienda di approfondire tecniche per interpretare i risultati delle reti neurali convoluzionali che utilizza per il controllo di qualità. L'azienda è leader nel settore dei semiconduttori e nella produzione di chip. Questi ultimi vengono ricavati da fette di silicio, chiamati "wafer", la cui produzione è lunga e articolata da molte fasi. Solitamente i difetti possono apparire in ogni fase di produzione, ma sono particolarmente problematici quando formano dei pattern precisi nei wafer, perché potrebbero indicare dei malfunzionamenti di produzione che, se non prontamente riconosciuti, causano uno spreco di risorse. Per questo motivo, al fine di riconoscere i pattern specifici, gli ingegneri esperti di produzione analizzano immagini dei wafer prodotte da macchine di ispezione lungo la linea di produzione. Recentemente l'azienda ha deciso di automatizzare la fase di controllo utilizzando una rete neurale convoluzionale per classificare automaticamente i differenti pattern di difettosità. Le classi di difettosità sono 13 e la rete utilizzata è costituita da diversi layer convoluzionali sparsi per gestire le immagini dei difetti ad alta risoluzione (20,000x20,000 pixels). Dal momento che il riconoscimento di una classe di difettosità è cruciale per prendere le giuste azioni correttive, non è sufficiente basarsi sulla classe predetta dalla rete. Per questo motivo i metodi di interpretazione delle reti sono di fondamentale importanza per verificare la presenza del pattern predetto. I metodi di interpretazione nello stato dell'arte ingrandiscono di un fattore elevato (ad esempio 32) una mappa di calore opportunamente calcolata, ottenendo una mappa ad alta risoluzione che, sovrapposta all' immagine originale, permette di visualizzare le regioni che maggiormente hanno influenzato la predizione di una rete. Invece di ingrandire una sola mappa di calore a bassa risoluzione, la nostra intuizione è di aggregare più mappe a bassa risoluzione ottenute da immagini aumentate (prodotte dall'immagine originale attraverso la data augmentation). Pertanto proponiamo "Multiple Augmented CAM" (MACAM), un framework generale che aggrega attraverso la "Multi-Frame Super-Resolution" mappe di calore a bassa risoluzione, indipendentemente dall'algoritmo utilizzato per calcolarle, in una mappa di calore ad alta risoluzione che dà più importanza a mappe generate da immagini maggiormente rappresentative della classe predetta, mediante l'utilizzo di pesi. Studiamo inoltre l'efficacia di selezionare ad hoc per ciascuna immagine le trasformazioni utilizzate durante l'image augmentation, in modo da mettere maggiormente in risalto la classe predetta. Dopo aver discusso le metriche più significative per un confronto equo tra metodi che generano mappe di calore ad alta risoluzione, mostreremo che il nostro approccio genera mappe di qualità superiore e più localizzate sull'oggetto predetto nelle immagini, rispetto ai metodi nello stato dell'arte. Discuteremo i risultati ottenuti su immagini naturali presenti nel dataset ILSVRC e su immagini industriali sparse dei difetti nei wafer fornite da STMicroelectronics.