Oversampling techniques to improve fraud detection

We live in a world where almost everyone has access to internet, with the number of users growing every year; more and more of them are using online banking services actively every day, because these are easy and fast to use, enabling payments in few seconds. In this context there are a lot of individuals trying to steal money from people with different kinds of techniques. Fraud Detection has become a primary need for banks and financial institutions, that can prevent the losses up to billions of Euros every year, both for costumers and providers. Aim of the thesis is to present a framework for fraud detection in the context of credit card transactions, with a specific focus on how oversampling can improve performance. The framework is able to address all the major problems coming from fraud detection: class imbalance, verification latency, and concept drift. In particular we focused on the problem of class imbalance, testing and comparing different oversampling techniques in order to balance the dataset, including a new technique developed by us, that is based on genetic algorithms. The dataset used is available on the Kaggle repository [30], that contains 284:807 transactions, of which 492 frauds, spanning on a period of 48 hours, characterized by 31 features, of which 28 of them were anonymized to preserve the privacy of the cardholders, while the remaining were Time, Amount, and Class. The work was carried out during my internship at the company Technology Reply, where I learned to use new tools that allowed me to conduct this thesis. In our experiments we found out that in our settings, the proposed oversampling solutions helped fraud detection, leading to better performance with respect to the baseline. The results are limited to the dataset we used, but they are promising and they should be tested on different ones.

Viviamo in un mondo dove quasi tutti hanno accesso ad internet, con il numero di utenti in crescendo ogni anno; sempre pi`u di loro usano attivamente servizi di online banking ogni giorno, per via della loro facilit`a e velocit`a di utilizzo, permettendo di effettuare pagamenti in pochi secondi. In questo contesto molti individui cercano di approfittarsene per rubare soldi con diversi tipi di tecniche. La Fraud Detection `e diventata una priorit`a per banche e istituti finanziari, che grazie ad essa pu`o prevenire perdite di miliardi di Euro ogni anno, sia per loro che per i loro clienti. Lo scopo della tesi `e lo sviluppo di un sistema per la fraud detection nel contesto delle transazioni relative alle carte di credito, con attenzione particolare a come le tecniche di sovracampionamento possono migliorare le prestazioni. Il sistema `e capace di affrontare e risolvere i principali problemi relativi alla fraud detection: sbilanciamento delle classi, latenza di verifica, e non stazionarit`a dei dati. In particolare ci siamo concentrati sul problema dello sbilanciamento delle classi, testando e comparando diverse tecniche di sovracampionamento per bilanciare il dataset, compresa una nuova tecnica sviluppata da noi basata su algoritmi genetici. Il dataset che abbiamo usato, dispoibile sul sito di Kaggle [27], contiene 284807 transazioni che coprono un periodo di 48 ore e che sono caratterizzate da 31 attributi, di cui 28 anonimizzate per preservare la privacy degli utenti, mentre le restanti erano Tempo, Ammontare della transazione, e Classe. Il lavoro `e stato svolto durante il mio periodo di stage nell’azienda Technology Reply, dove sono venuto a consocenza e ho imparato ad usare nuovi strumenti che mi hanno permesso di portare avanti questa tesi. Durante i nostri esperimenti abbiamo scoperto che per le nostre impostazioni il sovracampionamento ha reso migliore la detezione di frodi, portando a prestazioni migliori rispetto all’algoritmo di base. I risultati sono limitati al dataset in nostro possesso, ma sono molto promettenti e dovrebbero essere testati su altri dataset.