Automatic program repair for breaking dependency updates with large language models

External libraries are widely used to expedite software development, but like any software component, they are updated over time, introducing new features and deprecating or removing old ones. When a library introduces breaking changes, all its clients must be updated to avoid disruptions. This update, when it introduces a breaking change, is defined as a Breaking Dependency Update. Repairing such breakages is challenging and time-consuming because the error originates in the dependency, while the fix must be applied to the client codebase. Automatic Program Repair (APR) is a research area focused on developing techniques to repair code failures without human intervention. With the advent of Large Language Models (LLMs), learning-based APR techniques have significantly improved in software repair tasks. However, their effectiveness on Breaking Dependency Updates remains unexplored. This thesis aims to investigate the efficacy of an LLM-based APR approach to Breaking Dependency Updates and to examine the impact of different components on the model’s performance and efficiency. The focus is on the API differences between the old and new versions of the dependency and a set of error-type specific repair strategies. Experiments conducted on a subset of BUMP, a new benchmark for Breaking Dependency Updates, with a strong focus on build failures, demonstrate that a naive approach to these client breakages is insufficient. Additional context from the dependency changes is necessary. Furthermore, error-type specific repair strategies are essential to repair some blocking failures that prevent the tool from completely repairing the projects. Finally, our research found that GPT-4, Gemini, and Llama exhibit similar efficacy levels but differ significantly in cost-efficiency, with GPT-4 having the highest cost per repaired failure among the tested models, almost 30 times higher than Gemini.

Le librerie esterne sono ampiamente utilizzate per accelerare lo sviluppo software. Tuttavia, come qualsiasi componente software, vengono aggiornate nel tempo, introducendo nuove funzionalità e sconsigliando o rimuovendo quelle obsolete. Quando una libreria introduce cambiamenti non retro-compatibili (breaking changes), tutti i suoi client devono essere aggiornati per evitare errori o bug. Questo aggiornamento è definito Breaking Dependency Update. Risolvere tali errori è complesso e richiede tempo, poiché l'errore ha origine nella libreria, mentre la correzione deve essere applicata al codice del client. L'Automatic Program Repair (APR) è un'area di ricerca focalizzata sullo sviluppo di tecniche per correggere errori di codice senza intervento umano. Con l'avvento dei Large Language Models (LLM), le tecniche APR basate sull'apprendimento automatico hanno notevolmente migliorato le attività di riparazione del software. Tuttavia, la loro efficacia sui Breaking Dependency Updates rimane inesplorata. Questa tesi mira a indagare l'efficacia di un approccio APR basato su LLM nei Breaking Dependency Updates e a esaminare l'impatto dei diversi componenti sulle prestazioni e sull'efficienza del modello. L'attenzione è rivolta alle differenze di API tra le varie versioni della dipendenza e a una serie di strategie di risoluzione specifiche per tipo di errore. Gli esperimenti condotti su un sottoinsieme di BUMP, un nuovo benchmark per i Breaking Dependency Updates con un forte focus su errori in fase di compilazione, dimostrano che l'approccio tipicamente usato in APR è insufficiente per questo tipo di errori. È necessario un contesto aggiuntivo relativo alle modifiche della dipendenza. Inoltre, strategie di risoluzione specifiche per tipo di errore sono essenziali per superare alcuni ostacoli che impediscono allo strumento di riparare completamente i progetti. Infine, la nostra ricerca ha rilevato che GPT-4, Gemini e Llama mostrano livelli di efficacia simili, ma differiscono significativamente in termini di efficienza dei costi, con GPT-4 che presenta il costo più elevato per errore riparato tra i modelli testati, risultando quasi 30 volte superiore a Gemini.