LLMs-powered terraform code generation: a study on model performance and prompt techniques

Large Language Models (LLMs) have advanced rapidly in recent years, showing major improvements in reasoning, instruction-following, and code generation. However, their ability to generate correct and deployable Infrastructure-as-Code (IaC) remains underexplored. This thesis builds on the IaC-Eval benchmark to evaluate state-of-the-art (SOTA) LLMs, both commercial and open-source, across 150 infrastructure scenarios spanning six levels of architectural complexity. We compare prompting strategies and model families by evaluating their performance, cost-effectiveness, and characteristic error patterns. Our results reveal a significant maturation in capability: baseline pass@1 rates have more than doubled the original benchmark, with Claude 3.7 Sonnet achieving 41\% compared to the previous 19\% SOTA. Performance is pushed further by enhancement strategies, with Claude 4 Sonnet reaching a peak of 59\% via iterative multi-turn correction. Retrieval-Augmented Generation (RAG) also proved highly effective, particularly for elevating smaller models like GPT-4.1-mini into strong, cost-effective candidates, whereas Few-Shot and Chain-of-Thought approaches were often detrimental. While the relative performance ranking of these models aligns with other code-generation benchmarks such as SWE-bench, their absolute scores on the benchmark remain far from optimal, underscoring the unique challenges of this domain. The fundamental challenge remains a substantial "intent gap" between syntactically valid and semantically correct code. This gap defines their current role as powerful assisting tools that, when combined with human expertise for validation and correction, can greatly enhance DevOps workflow efficiency. However, bridging this multifaceted "intent gap"—rooted in the difficulty of reasoning about complex inter-resource dependencies and translating ambiguous natural language into precise specifications—is the critical frontier for making these models truly reliable for autonomous infrastructure automation.

Sebbene i Large Language Models (LLM) abbiano mostrato rapidi miglioramenti nella generazione di codice, la loro capacità di generare Infrastructure-as-Code (IaC) corretta e distribuibile per Terraform rimane una sfida critica e poco esplorata. Questa tesi si basa sul benchmark IaC-Eval, valutando LLM allo stato dell'arte in 150 scenari a complessità variabile, attraverso un'analisi multi-sfaccettata di correttezza, costo-efficienza e prompt engineering. I nostri risultati rivelano una significativa maturazione delle capacità: i tassi di successo pass@1 di base sono più che raddoppiati rispetto al benchmark originale, con Claude 3.7 Sonnet che raggiunge il 41\% rispetto al precedente SOTA del 19\%. Le prestazioni sono ulteriormente migliorate da strategie di potenziamento, con Claude 4 Sonnet che raggiunge un picco del 59\% tramite la correzione iterativa multi-turn. Anche la Retrieval-Augmented Generation (RAG) si è dimostrata molto efficace, in particolare nell'elevare modelli più piccoli come GPT-4.1-mini a candidati validi e convenienti, mentre gli approcci Few-Shot e Chain-of-Thought si sono spesso rivelati controproducenti. Sebbene la classifica relativa dei modelli sia in linea con altri benchmark come SWE-bench, i loro punteggi assoluti nel nostro benchmark rimangono lontani dall'essere ottimali, evidenziando le sfide uniche del dominio. Persiste un notevole "intent gap" tra codice sintatticamente valido e semanticamente corretto. Questo definisce il loro ruolo attuale: potenti strumenti di assistenza che, abbinati all'esperienza umana per la validazione e la correzione, possono migliorare l'efficienza dei flussi di lavoro DevOps. Colmare questo poliedrico divario, radicato nella difficoltà di gestire complesse interdipendenze e tradurre requisiti ambigui, rappresenta la frontiera critica per rendere questi modelli veramente affidabili per un'automazione infrastrutturale pienamente autonoma.