Enhancing contrastive learning with synthetic data from text-to-image models

Synthetic data from generative models offers a promising alternative to real data in contrastive learning, yet models trained on it often lag in performance and fail to capture and generalize on real-data distribution. This thesis investigates strategies to bridge this performance gap with structured synthetic data generation. We introduce SynthMPHN (Synthetic Multi-Positives and Hard-Negatives), a novel, end-to-end generation pipeline tailored for using such data as a contrastive training dataset. This pipeline utilizes LLMs to programmatically generate a base set of captions and a corresponding set of hard negative captions, which modify only the subject or specific attributes. A key contribution is the Multi-Synthetic-Model (MSM) approach to image generation. Instead of using a single generator, we employ a diverse pool of distinct diffusion models to create images for both base and hard-negative captions. Through extensive experiments, we demonstrate that this MSM strategy effectively navigates the critical trade-off between recognizability and diversity, forcing the model to learn more generalized representations. We then investigate advanced contrastive learning objectives, including Multi-Positive (MP-CLIP) training, the StableRep+ framework, and the hard-negative-aware Triplet CLIP loss. Furthermore, we introduce a hard negative scheduling strategy, a curriculum learning approach that gradually introduces hard negative samples in the training batch. Our findings show that these contributions yield substantial improvements over baseline synthetic data models, particularly on zero-shot classification and retrieval tasks. The MSM approach, combined with Multi-Positive objectives, yielded the best overall multi-task performance, demonstrating significant gains in vision-language understanding. While frameworks incorporating hard negatives did not ultimately surpass the multi-positive-only models, our hard negative scheduling strategy proved highly effective, substantially boosting their results. This work provides a comprehensive blueprint for creating structured, purpose-built synthetic datasets, while also identifying a key trade-off between optimizing for vision-language and vision-only tasks.

I dati sintetici provenienti da modelli generativi offrono un'alternativa promettente ai dati reali nel contrastive learning; tuttavia, i modelli addestrati su di essi registrano prestazioni inferiori e non riescono a generalizzare la distribuzione dei dati reali. Questa tesi analizza strategie per colmare questo divario prestazionale attraverso la generazione di dati sintetici strutturati. Introduciamo SynthMPHN (Synthetic Multi-Positives and Hard-Negatives), una nuova pipeline di generazione end-to-end progettata per utilizzare tali dati per il contrastive learning. Questa utilizza LLM per generare un set di didascalie di base e un corrispondente set di hard negative, che modificano solo il soggetto o attributi specifici. Un contributo chiave è l'approccio Multi-Synthetic-Model (MSM) alla generazione di immagini. Invece di utilizzare un singolo generatore, impieghiamo un insieme diversificato di modelli di diffusione per generare immagini sia per le didascalie base sia per quelle hard negative. Attraverso esperimenti approfonditi, dimostriamo che questa strategia gestisce efficacemente il trade-off tra riconoscibilità e diversità, costringendo il modello ad apprendere rappresentazioni più generalizzate. Successivamente analizziamo obiettivi contrastive avanzati, tra cui l'addestramento Multi-Positive (MP-CLIP), il framework StableRep+ e la loss TripletCLIP, specifica per gli hard negative. Inoltre, introduciamo una strategia di hard negative scheduling, un approccio di curriculum learning che introduce gradualmente campioni hard negative nel batch. I nostri risultati mostrano che questi contributi portano a miglioramenti sostanziali rispetto ai modelli di dati sintetici di base, in particolare nei compiti zero-shot. L'approccio MSM, combinato con obiettivi Multi-Positive, ha prodotto le migliori prestazioni multi-task complessive, dimostrando guadagni significativi nella comprensione visione-linguaggio. Sebbene i framework che incorporano gli hard negative non abbiano in definitiva superato i modelli basati solo su multi-positive, la nostra strategia di hard negative scheduling si è rivelata molto efficace, aumentando notevolmente i loro risultati. Questo lavoro fornisce un modello completo per la creazione di dataset sintetici strutturati e mirati, identificando al contempo un trade-off chiave tra l'ottimizzazione per i compiti di visione-linguaggio e per quelli esclusivamente visivi.