A holistic generative adversarial network-based methodology for synthetic banking dataset generation

Nowadays, the online fraud phenomenon is raising increasing concern in the banking domain. This situation raises the need to develop sophisticated fraud detection systems that require real data to be trained. Unluckily, the availability of real data about transactions is limited by financial institutions' strict privacy requirements. To cope with this issue, the research community focuses on the generation of synthetic data resembling the characteristics of authentic ones while overcoming the privacy implications of sharing them. In particular, generative adversarial networks (GAN) proved to be an effective solution in several fields characterized by privacy constraints. Even though existing works apply this concept to the banking framework, the literature lacks a holistic approach able to manage the banking data peculiarities while being compliant with privacy standards. In this thesis, we propose a GAN-based methodology to address the limited availability of public available banking datasets. To achieve this, we perform an exploratory data analysis of a dataset of transactions obtained thanks to a collaboration with a leading Italian bank. After gathering a deep understanding of the most relevant patterns, we create user profiles aggregating features selected from the initial dataset. We pre-process these data through robust scaling and train with the latter a GAN model extended with label smoothing. We disaggregate the generated user profiles, reconstructing each feature of the transaction domain, preserving the initial probability distribution of values and the relations existing between attributes. We validate our approach through visual inference and evaluate the performances of a classifier trained with synthetic data and tested on real ones, and vice-versa. Our results show that the generated dataset incorporates the real one's characteristics remarkably without deteriorating the number of frauds correctly classified by the algorithm. In addition, we depict a cost analysis for financial entities minimizing the costs of fraud detection tasks, together with an assessment of the most widely shared privacy-preserving standard, namely differential privacy.

Negli ultimi anni, il fenomeno delle frodi online sta suscitando crescente preoccupazione tra le istituzioni finanziarie. Di conseguenza, la comunità scientifica sta sviluppando sofisticati sistemi di rilevamento delle frodi che necessitano nella fase di training di dati realistici, la cui disponibilità è limitata da stringenti regolamentazioni sulla privacy. Una soluzione innovativa è costituita dalla generazione di dati sintetici, che rispecchino le caratteristiche dei dati originali, ma che superino i limiti imposti dalla privacy sulla loro condivisione. Le reti generative avversarie (GAN) si sono dimostrate una soluzione efficace in vari ambiti caratterizzati da vincoli di privacy. Nonostante alcuni studi applichino le GAN al mondo bancario, nella letteratura manca un approccio olistico per gestire le peculiarità dei dati bancari, monitorando il rispetto della privacy. In questa tesi, proponiamo una metodologia basata sulle GAN per affrontare il tema della limitata disponibilità di dati bancari. Inizialmente eseguiamo un'analisi esplorativa di un dataset di transazioni ottenute grazie a una collaborazione con un’importante banca italiana. Dopo aver compreso quali siano i pattern più rilevanti, aggreghiamo gli attributi del dataset per costruire i profili degli utenti coinvolti. Quindi, usiamo questi dati per allenare una GAN con label smoothing. Dopo aver generato un campione di utenti sintetici, ne espandiamo le caratteristiche per ottenere un dataset di transazioni realistico, considerando la distribuzione reale dei valori e le relazioni esistenti tra gli attributi. Per validare il nostro approccio, sviluppiamo un'analisi grafica e valutiamo le performance di un classificatore addestrato con dati sintetici e testato su dati reali, e viceversa. I nostri risultati mostrano che il dataset generato incorpora le caratteristiche di quello reale, senza peggiorare il numero di frodi correttamente classificate dall'algoritmo. Inoltre, descriviamo un'analisi dei costi delle attività di rilevamento delle frodi e dimostriamo la robustezza del nostro metodo nei confronti della differential privacy, uno standard condiviso per il rispetto della privacy.