This thesis dives into the intricate sphere of unsupervised anomaly detection within Docker container environments, specifically focusing on the identification of zero-day attacks. This study analyzes Common Vulnerabilities and Exposures (CVE) and leverages multivariate time series data obtained from system calls and CPU usage as the foundation for anomaly detection. Our methodology integrates an extensive literature survey, a comprehensive data management system featuring an all-inclusive data collection and preprocessing pipeline, realistic workload simulations, attack simulation exploiting the proper vulnerability and data capture using sysdig and cadvisor tools. To simulate realistic scenarios, we introduce a workload agent that generates concurrent user workloads tailored to the specific web server. These workloads mimic different user roles and behaviors, enhancing the authenticity of the collected data. Additionally, an attack agent is developed to simulate attacks using known exploits associated with the CVEs. These attacks encompass various types, such as Denial of Service (DoS), resource exhaustion, and malicious data insertion. In our model factory, we deploy an array of 17 models, encompassing 4 traditional machine learning models (PCA, LOF, Isolation Forest, OneClassSVM) and 13 custom deep learning models, which range from basic neural networks and autoencoders (Autoencoder, CNN, VAE), recurrent neural networks and variations (Autoencoder-LSTM, Bidirectional Autoencoder Resnet LSTM), generative models (GAN) and attention-based models (TransformerAutoencoder). These models collectively form a robust toolkit for detecting anomalies. The performance of these models is meticulously evaluated using a variety of metrics, with the Area Under the Receiver Operating Characteristic Curve (AUROC) being the primary performance measure, complemented by precision, recall, accuracy, among others. The study explores the impact of various data types, window sizes and noise levels on model performance, providing a profound analysis on which model is the most promising for detecting zero days attack. Moreover, a detailed feature importance analysis was conducted, exploring the role of various system call and cpu features in anomaly detection. The TransformerAutoencoder emerged as a highly performant model across different web applications and data configurations, despite its computational intensiveness. Traditional machine learning models also exhibited promising results, highlighting the trade-offs between simplicity, performance, and computational resources. The research highlighted the effectiveness of the bag-of-syscall representation for raw system call data in detecting zero-day attacks. Future work could explore the application of word embedding techniques on raw syscall data and conduct a more comprehensive investigation into the impact of window size on anomaly detection. Despite several inherent challenges, such as generalizability of results, data quality, hyperparameter tuning, computational complexity, and the representation of syscall data, the study yielded valuable insights and shed light on limitations and recommendations for future research. Future research is encouraged to delve into advanced deep learning architectures, investigate ensemble learning methods, integrate additional data sources, develop real-time adaptive learning techniques, and evaluate the proposed approaches in real-world scenarios. This thesis paves the way for advanced explorations in unsupervised anomaly detection within Docker container environments, thereby promoting the development of increasingly effective methods for anomaly detection and providing a solid foundation for future investigations in the field.

This thesis dives into the intricate sphere of unsupervised anomaly detection within Docker container environments, specifically focusing on the identification of zero-day attacks. This study analyzes Common Vulnerabilities and Exposures (CVE) and leverages multivariate time series data obtained from system calls and CPU usage as the foundation for anomaly detection. Our methodology integrates an extensive literature survey, a comprehensive data management system featuring an all-inclusive data collection and preprocessing pipeline, realistic workload simulations, attack simulation exploiting the proper vulnerability and data capture using sysdig and cadvisor tools. To simulate realistic scenarios, we introduce a workload agent that generates concurrent user workloads tailored to the specific web server. These workloads mimic different user roles and behaviors, enhancing the authenticity of the collected data. Additionally, an attack agent is developed to simulate attacks using known exploits associated with the CVEs. These attacks encompass various types, such as Denial of Service (DoS), resource exhaustion, and malicious data insertion. In our model factory, we deploy an array of 17 models, encompassing 4 traditional machine learning models (PCA, LOF, Isolation Forest, OneClassSVM) and 13 custom deep learning models, which range from basic neural networks and autoencoders (Autoencoder, CNN, VAE), recurrent neural networks and variations (Autoencoder-LSTM, Bidirectional Autoencoder Resnet LSTM), generative models (GAN) and attention-based models (TransformerAutoencoder). These models collectively form a robust toolkit for detecting anomalies. The performance of these models is meticulously evaluated using a variety of metrics, with the Area Under the Receiver Operating Characteristic Curve (AUROC) being the primary performance measure, complemented by precision, recall, accuracy, among others. The study explores the impact of various data types, window sizes and noise levels on model performance, providing a profound analysis on which model is the most promising for detecting zero days attack. Moreover, a detailed feature importance analysis was conducted, exploring the role of various system call and cpu features in anomaly detection. The TransformerAutoencoder emerged as a highly performant model across different web applications and data configurations, despite its computational intensiveness. Traditional machine learning models also exhibited promising results, highlighting the trade-offs between simplicity, performance, and computational resources. The research highlighted the effectiveness of the bag-of-syscall representation for raw system call data in detecting zero-day attacks. Future work could explore the application of word embedding techniques on raw syscall data and conduct a more comprehensive investigation into the impact of window size on anomaly detection. Despite several inherent challenges, such as generalizability of results, data quality, hyperparameter tuning, computational complexity, and the representation of syscall data, the study yielded valuable insights and shed light on limitations and recommendations for future research. Future research is encouraged to delve into advanced deep learning architectures, investigate ensemble learning methods, integrate additional data sources, develop real-time adaptive learning techniques, and evaluate the proposed approaches in real-world scenarios. This thesis paves the way for advanced explorations in unsupervised anomaly detection within Docker container environments, thereby promoting the development of increasingly effective methods for anomaly detection and providing a solid foundation for future investigations in the field.

Anomaly detection framework and deep learning techniques for zero-day attack in container based environment

ROSSOTTI, ALESSANDRO
2022/2023

Abstract

This thesis dives into the intricate sphere of unsupervised anomaly detection within Docker container environments, specifically focusing on the identification of zero-day attacks. This study analyzes Common Vulnerabilities and Exposures (CVE) and leverages multivariate time series data obtained from system calls and CPU usage as the foundation for anomaly detection. Our methodology integrates an extensive literature survey, a comprehensive data management system featuring an all-inclusive data collection and preprocessing pipeline, realistic workload simulations, attack simulation exploiting the proper vulnerability and data capture using sysdig and cadvisor tools. To simulate realistic scenarios, we introduce a workload agent that generates concurrent user workloads tailored to the specific web server. These workloads mimic different user roles and behaviors, enhancing the authenticity of the collected data. Additionally, an attack agent is developed to simulate attacks using known exploits associated with the CVEs. These attacks encompass various types, such as Denial of Service (DoS), resource exhaustion, and malicious data insertion. In our model factory, we deploy an array of 17 models, encompassing 4 traditional machine learning models (PCA, LOF, Isolation Forest, OneClassSVM) and 13 custom deep learning models, which range from basic neural networks and autoencoders (Autoencoder, CNN, VAE), recurrent neural networks and variations (Autoencoder-LSTM, Bidirectional Autoencoder Resnet LSTM), generative models (GAN) and attention-based models (TransformerAutoencoder). These models collectively form a robust toolkit for detecting anomalies. The performance of these models is meticulously evaluated using a variety of metrics, with the Area Under the Receiver Operating Characteristic Curve (AUROC) being the primary performance measure, complemented by precision, recall, accuracy, among others. The study explores the impact of various data types, window sizes and noise levels on model performance, providing a profound analysis on which model is the most promising for detecting zero days attack. Moreover, a detailed feature importance analysis was conducted, exploring the role of various system call and cpu features in anomaly detection. The TransformerAutoencoder emerged as a highly performant model across different web applications and data configurations, despite its computational intensiveness. Traditional machine learning models also exhibited promising results, highlighting the trade-offs between simplicity, performance, and computational resources. The research highlighted the effectiveness of the bag-of-syscall representation for raw system call data in detecting zero-day attacks. Future work could explore the application of word embedding techniques on raw syscall data and conduct a more comprehensive investigation into the impact of window size on anomaly detection. Despite several inherent challenges, such as generalizability of results, data quality, hyperparameter tuning, computational complexity, and the representation of syscall data, the study yielded valuable insights and shed light on limitations and recommendations for future research. Future research is encouraged to delve into advanced deep learning architectures, investigate ensemble learning methods, integrate additional data sources, develop real-time adaptive learning techniques, and evaluate the proposed approaches in real-world scenarios. This thesis paves the way for advanced explorations in unsupervised anomaly detection within Docker container environments, thereby promoting the development of increasingly effective methods for anomaly detection and providing a solid foundation for future investigations in the field.
Kabir-Querrec, Maelle
Guo, Shuai
ING - Scuola di Ingegneria Industriale e dell'Informazione
18-lug-2023
2022/2023
This thesis dives into the intricate sphere of unsupervised anomaly detection within Docker container environments, specifically focusing on the identification of zero-day attacks. This study analyzes Common Vulnerabilities and Exposures (CVE) and leverages multivariate time series data obtained from system calls and CPU usage as the foundation for anomaly detection. Our methodology integrates an extensive literature survey, a comprehensive data management system featuring an all-inclusive data collection and preprocessing pipeline, realistic workload simulations, attack simulation exploiting the proper vulnerability and data capture using sysdig and cadvisor tools. To simulate realistic scenarios, we introduce a workload agent that generates concurrent user workloads tailored to the specific web server. These workloads mimic different user roles and behaviors, enhancing the authenticity of the collected data. Additionally, an attack agent is developed to simulate attacks using known exploits associated with the CVEs. These attacks encompass various types, such as Denial of Service (DoS), resource exhaustion, and malicious data insertion. In our model factory, we deploy an array of 17 models, encompassing 4 traditional machine learning models (PCA, LOF, Isolation Forest, OneClassSVM) and 13 custom deep learning models, which range from basic neural networks and autoencoders (Autoencoder, CNN, VAE), recurrent neural networks and variations (Autoencoder-LSTM, Bidirectional Autoencoder Resnet LSTM), generative models (GAN) and attention-based models (TransformerAutoencoder). These models collectively form a robust toolkit for detecting anomalies. The performance of these models is meticulously evaluated using a variety of metrics, with the Area Under the Receiver Operating Characteristic Curve (AUROC) being the primary performance measure, complemented by precision, recall, accuracy, among others. The study explores the impact of various data types, window sizes and noise levels on model performance, providing a profound analysis on which model is the most promising for detecting zero days attack. Moreover, a detailed feature importance analysis was conducted, exploring the role of various system call and cpu features in anomaly detection. The TransformerAutoencoder emerged as a highly performant model across different web applications and data configurations, despite its computational intensiveness. Traditional machine learning models also exhibited promising results, highlighting the trade-offs between simplicity, performance, and computational resources. The research highlighted the effectiveness of the bag-of-syscall representation for raw system call data in detecting zero-day attacks. Future work could explore the application of word embedding techniques on raw syscall data and conduct a more comprehensive investigation into the impact of window size on anomaly detection. Despite several inherent challenges, such as generalizability of results, data quality, hyperparameter tuning, computational complexity, and the representation of syscall data, the study yielded valuable insights and shed light on limitations and recommendations for future research. Future research is encouraged to delve into advanced deep learning architectures, investigate ensemble learning methods, integrate additional data sources, develop real-time adaptive learning techniques, and evaluate the proposed approaches in real-world scenarios. This thesis paves the way for advanced explorations in unsupervised anomaly detection within Docker container environments, thereby promoting the development of increasingly effective methods for anomaly detection and providing a solid foundation for future investigations in the field.
File allegati
File Dimensione Formato  
Rossotti_Alessandro_Master_Thesis____Politecnico_di_Milano.pdf

non accessibile

Dimensione 11.45 MB
Formato Adobe PDF
11.45 MB Adobe PDF   Visualizza/Apri
Alessandro_Rossotti_Executive_Summary_Politecnico_di_Milano.pdf

non accessibile

Descrizione: Executive Summary
Dimensione 2.17 MB
Formato Adobe PDF
2.17 MB Adobe PDF   Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/211908