The growing complexity of modern information systems, which offer services, infrastructures, and databases, makes it increasingly crucial to ensure their resilience. To this end, Chaos Engineering emerges as a practical discipline for verifying system resilience under real-world, operational conditions. This practice, which involves fault injection, was originally developed at Netflix and is supported by various tools from Netflix and other sources. The interest in this discipline is steadily growing in both the industrial and academic sectors. We conducted a gray literature review to understand the key concepts of Chaos Engineering as perceived by practitioners who publish blogs and online guides. We followed the scientific method for conducting a literature review with gray sources, first conceptualizing the problem through the main academic contributions, then searching, selecting the gray sources, and extracting the key concept categories through a classification framework. Additionally, we carried out a proof-of-concept in an industrial environment, applying Chaos Engineering to evaluate the benefits obtained. Specifically, we focused on experiments designed to test a new data collection system, concentrating on potential failures that the industrial environment had already experienced. We analyzed the industrial background and studied the architecture of the target system. We designed the experiments based on system expert suggestions and the failure mode and effect analysis (FMEA) that they made. Finally, we conducted the chaos experiments observing the system and reporting the results.
La crescente complessità dei moderni sistemi informatici, che offrono servizi, infrastrutture e database, rende sempre più cruciale la necessità di garantirne la resilienza. A tal fine, il Chaos Engineering si propone come una disciplina pratica per verificare la resilienza di un sistema in condizioni operative reali. Questa pratica, che impiega l'iniezione di guasti, è stata originariamente sviluppata presso Netflix ed è supportata da diversi strumenti, sia da parte di Netflix stessa che da altre fonti. L'interesse per questa disciplina è in costante crescita sia nel settore industriale che in quello accademico. Abbiamo condotto una revisione della letteratura grigia per comprendere i concetti chiave del Chaos Engineering così come percepiti dai professionisti che pubblicano blog e guide online. Abbiamo seguito il metodo scientifico per condurre una revisione della letteratura utilizzando fonti grigie, prima concettualizzando il problema attraverso i principali contributi accademici, successivamente cercando, selezionando le fonti grigie ed estraendo da esse le categorie di concetti chiave tramite un quadro di classificazione. Inoltre, abbiamo realizzato un prova di fattibilità in un contesto industriale, applicando il Chaos Engineering per valutare i benefici ottenuti. In particolare, ci siamo concentrati su esperimenti progettati per testare un nuovo sistema di raccolta dati, focalizzandoci sui potenziali guasti che l'ambiente industriale aveva già sperimentato. Abbiamo analizzato il contesto industriale e studiato l'architettura del sistema target. Abbiamo progettato l'esperimento basandoci sui suggerimenti degli esperti di sistema e sulla analisi delle modalità di guasto e degli effetti (FMEA) che hanno eseguito sul sistema. Infine abbiamo condotto gli esperimenti del chaos osservando il sistema e riportato i risultati.
Leveraging Chaos Engineering to enhance data collection infrastructures' resilience: an industrial proof-of-concept study
Fossati, Stefano
2023/2024
Abstract
The growing complexity of modern information systems, which offer services, infrastructures, and databases, makes it increasingly crucial to ensure their resilience. To this end, Chaos Engineering emerges as a practical discipline for verifying system resilience under real-world, operational conditions. This practice, which involves fault injection, was originally developed at Netflix and is supported by various tools from Netflix and other sources. The interest in this discipline is steadily growing in both the industrial and academic sectors. We conducted a gray literature review to understand the key concepts of Chaos Engineering as perceived by practitioners who publish blogs and online guides. We followed the scientific method for conducting a literature review with gray sources, first conceptualizing the problem through the main academic contributions, then searching, selecting the gray sources, and extracting the key concept categories through a classification framework. Additionally, we carried out a proof-of-concept in an industrial environment, applying Chaos Engineering to evaluate the benefits obtained. Specifically, we focused on experiments designed to test a new data collection system, concentrating on potential failures that the industrial environment had already experienced. We analyzed the industrial background and studied the architecture of the target system. We designed the experiments based on system expert suggestions and the failure mode and effect analysis (FMEA) that they made. Finally, we conducted the chaos experiments observing the system and reporting the results.File | Dimensione | Formato | |
---|---|---|---|
2024_07_Fossati_Executive Summary.pdf
solo utenti autorizzati a partire dal 21/06/2025
Descrizione: Executive Summary
Dimensione
692.76 kB
Formato
Adobe PDF
|
692.76 kB | Adobe PDF | Visualizza/Apri |
2024_07_Fossati_Tesi.pdf
non accessibile
Descrizione: Testo della tesi completo / Full text of the thesis
Dimensione
3.52 MB
Formato
Adobe PDF
|
3.52 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/222695