An empirical study on synthetic image generation techniques for object detectors

Convolutional Neural Networks are a very powerful machine learning tool that outperformed other techniques in image recognition tasks. The biggest drawback of this method is the massive amount of training data required, since producing training data for image recognition tasks is very labor intensive. To tackle this issue, different techniques have been proposed to generate synthetic training data automatically. These synthetic data generation techniques can be grouped in two categories: the first category generates synthetic images using computer graphic software and CAD models of the objects to recognize; the second category generates synthetic images by cutting the object from an image and pasting it on another image. Since both techniques have their pros and cons, it would be interesting for industries to investigate more in depth the two approaches. A common use case in industrial scenarios is detecting and classifying objects inside an image. Different objects appertaining to classes relevant in industrial scenarios are often undistinguishable (for example, they all the same component). For these reasons, this thesis work aims to answer the research question “Which technique is more suitable for generating synthetic images for training object detectors in industrial scenarios”. In order to answer the research question, two synthetic image generation techniques appertaining to the two categories are proposed. The proposed techniques are tailored for applications where all the objects appertaining to the same class are indistinguishable, but they can also be extended to other applications. The two synthetic image generation techniques are compared measuring the performances of an object detector trained using synthetic images on a test dataset of real images. The performances of the two synthetic data generation techniques used for data augmentation have been also measured. The empirical results show that the CAD models generation technique works significantly better than the Cut-Paste generation technique where synthetic images are the only source of training data (61% better), whereas the two generation techniques perform equally good as data augmentation techniques. Moreover, the empirical results show that the models trained using only synthetic images performs almost as good as the model trained using real images (7,4% worse) and that augmenting the dataset of real images using synthetic images improves the performances of the model (9,5% better).

Le reti neurali convoluzionali sono uno strumento di machine learning molto potente che ha superato altre tecniche nel riconoscimento di immagini. Il più grande svantaggio di questo metodo è l'enorme quantità di dati richiesti per il training di questi modelli, dal momento che produrre dati di training per il riconoscimento di immagini richiede molto lavoro manuale. Per far fronte a questo problema, sono state proposte diverse tecniche per la generazione automatica di dati di training sintetici. Queste tecniche di generazione di dati sintetici possono essere raggruppate in due categorie: la prima categoria genera immagini sintetiche utilizzando software di computer grafica e modelli CAD degli oggetti da riconoscere; la seconda categoria genera immagini sintetiche tagliando l'oggetto da un'immagine e incollandolo su un'altra immagine. Poiché entrambe le tecniche hanno i loro pro e contro, sarebbe interessante per le industrie investigare più in profondità i due approcci. Una comune applicazione di sistemi di visione artificiale in ambiente industriale è il riconoscimento di oggetti all'interno di un'immagine. Diversi oggetti tipici di ambienti industriali sono spesso indistinguibili (ad esempio, tutti lo stesso componente). Per queste ragioni, questo lavoro di tesi propone di rispondere alla domanda di ricerca "Tra le tecniche di generazione con modelli CAD, le tecniche di generazione taglia-incolla e una combinazione delle due tecniche, quale tecnica è più adatta per generare immagini per la formazione di rivelatori di oggetti in scenari industriali?”. Per rispondere alla domanda di ricerca, vengono proposte due tecniche di generazione di immagini sintetiche che appartengono alle due categorie. Le tecniche proposte sono state progettate specificatamente per applicazioni in cui tutti gli oggetti appartenenti alla stessa classe sono indistinguibili, ma possono anche essere estesi ad altre applicazioni. Le due tecniche di generazione di immagini sintetiche vengono confrontate misurando le prestazioni di un rilevatore di oggetti addestrato utilizzando immagini sintetiche e testato su un set di immagini reali. Sono state anche misurate le prestazioni delle due tecniche di generazione di immagini allo scopo di aumentare un set di immagini reali. I risultati empirici mostrano che la tecnica di generazione dei modelli CAD ha risultati significativamente migliori rispetto alla tecnica di generazione Taglia-Incolla in casi in cui le immagini sintetiche sono l'unica fonte di dati di addestramento (61% migliore), mentre le due tecniche presentato prestazioni simili nell’aumentazione di dati. Inoltre, i risultati empirici mostrano che i modelli addestrati usando solo immagini sintetiche hanno prestazioni quasi equivalenti a quelle del modello addestrato usando immagini reali (7,4% peggiori) e che aumentando il set di dati di immagini reali usando immagini sintetiche migliora le prestazioni del modello (9,5% migliore).