The greater popularity gained by online banking services in recent years, has brought with it an increase of frauds generated by cyber attacks (e.g., malware, phishing or trojans). These attacks have the aim of stealing the most amount of money as possible. To remain undetected, fraudsters always look for new methods to perpetrate crimes. For these reasons, research in fraud detection is constantly improving. However, the share of real transactional data among community is strongly limited due to privacy and security reasons related to bank context. Furthermore, there are few tools that allow to generate synthetic data. In this thesis we present BankDataGen, a system for generating synthetic Internet banking transactions that, through the use of data mining techniques, identifies which are the most important features of an authentic dataset and reproduces them. Starting from a real dataset made available by an Italian banking group, we extract user's profiles. Thanks to these profiles, we perform a clustering based on the principal components, that allows us to divide users depending on the type of spending pattern and to extract their most relevant attributes. Finally, we apply distribution fitting techniques on the selected attributes. It is important to notice that, the creation of the synthetic dataset is not limited only to the period of the real data given in input, but it is possible to determine the past and the future trend of the transactions distribution through the use of forecasting methods. The final output of BankDataGen is represented by a synthetic dataset that reflects the characteristics of the real one. In addition, we give the possibility to insert fraudulent transactions generated on the basis of typical attacks performed against online banking users. In conclusion, we implement a web application in order to provide a tool for generating synthetic dataset, whose characteristics are borrowed from a real dataset. We perform comparative tests between the reference data and the data generated to assess the quality of the results. We obtain good achievements with a generally high degree of similarity between original and synthetic data.
Internet banking dataset generator for fraud detection benchmarking
MARIANI, EMANUELE
2014/2015
Abstract
The greater popularity gained by online banking services in recent years, has brought with it an increase of frauds generated by cyber attacks (e.g., malware, phishing or trojans). These attacks have the aim of stealing the most amount of money as possible. To remain undetected, fraudsters always look for new methods to perpetrate crimes. For these reasons, research in fraud detection is constantly improving. However, the share of real transactional data among community is strongly limited due to privacy and security reasons related to bank context. Furthermore, there are few tools that allow to generate synthetic data. In this thesis we present BankDataGen, a system for generating synthetic Internet banking transactions that, through the use of data mining techniques, identifies which are the most important features of an authentic dataset and reproduces them. Starting from a real dataset made available by an Italian banking group, we extract user's profiles. Thanks to these profiles, we perform a clustering based on the principal components, that allows us to divide users depending on the type of spending pattern and to extract their most relevant attributes. Finally, we apply distribution fitting techniques on the selected attributes. It is important to notice that, the creation of the synthetic dataset is not limited only to the period of the real data given in input, but it is possible to determine the past and the future trend of the transactions distribution through the use of forecasting methods. The final output of BankDataGen is represented by a synthetic dataset that reflects the characteristics of the real one. In addition, we give the possibility to insert fraudulent transactions generated on the basis of typical attacks performed against online banking users. In conclusion, we implement a web application in order to provide a tool for generating synthetic dataset, whose characteristics are borrowed from a real dataset. We perform comparative tests between the reference data and the data generated to assess the quality of the results. We obtain good achievements with a generally high degree of similarity between original and synthetic data.File | Dimensione | Formato | |
---|---|---|---|
2016_04_Mariani.pdf
accessibile in internet solo dagli utenti autorizzati
Descrizione: Thesis - final version
Dimensione
4.2 MB
Formato
Adobe PDF
|
4.2 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/119228