self-admitted technical debt detection and management in issue tracker systems

Technical Debt metaphor in software development introduces a compromise to deliver short-term goals in a way that can negatively affect the health and maintainability of products in the long term. Self-Admitted Technical Debt Issue-based (SATD-I) is a branch of Technical Debt that refers to acknowledged debt reported by software developers in issue tracker systems. The purpose of this work is to identify SATD-I (Code debt, Documentation debt, and Test debt) from text, using Natural Language Processing techniques, and analyze the behaviour of developers against it, by studying who resolves debt, how much debt is solved, how long is needed to resolve it. To do this, two different Machine Learning models were used and their results compared. Using 972 manually classified issues, we trained and validated an SVM and a Logistic Regression models, resulting on F1 scores of 0.678 for the former, and 0.7722 for the latter. Due to higher performance, we extracted a set of 1500 weighted tokens from the logistic regression trained model, that shows how SATD-I can be identified. We used this model to classify a dataset of 2.3M issues from more than 20 years of development of Mozilla and Apache projects, using these data to compare how developers approach SATD-I, against the rest of the issues. The results showed that most of the products have a percentage of Technical Debt issues between 14.8% and 23.3%, that Code Debt is often paid in the first week from its report, that the percentage of paid debt is between 68.86% and 93.74%, and that the creator of the issues also resolves it between 43.54% and 93.08% of the cases.

Nello sviluppo software, la metafora del Debito Tecnico indica un compromesso atto a conseguire obiettivi di breve termine in un modo che può influire negativamente sulla salute e sulla manutenibilità di prodotti software sul lungo periodo. Il Debito Tecnico Self-Admitted Issue-based (SATD-I) è una sotto-categoria del Debito Tecnico che indica debito riconosciuto dagli sviluppatori e riportato in sistemi di tracciamento di issues. L'obiettivo di questo lavoro è di identificare il SATD-I (nello specifico debito di codice, debito di documentazione, e debito di test) dal testo, con sistemi di Natural Language Processing, e analizzare il comportamento degli sviluppatori a riguardo, studiando chi risolve le issues relative al debito tecnico, quante issues sono risolte, e quanto tempo è necessario per risolverle. Per rispondere a questi quesiti, sono stati utilizzati due modelli di Machine Learning, comparando tra loro i risultati ottenuti. Con 972 issues classificate a mano, abbiamo addestrato e validato una modello basato su Support Vector Machine, e uno basato su Logistic Regression, ottenendo un punteggio F1 di 0.678 per il primo, e di 0.7722 per il secondo. In seguito, abbiamo estratto un insieme di 1500 token dal modello di Logistic Regression, addestrato, a causa delle sue migliori performace, per mostrare come il SATD-I può essere identificato e spiegare i risultati del modello. Abbiamo poi classificato 2.3M di issues che coprono oltre 20 anni di sviluppo di progetti Apache e Mozilla, usando questi dati per capire qual è l'approccio degli sviluppatori verso il SATD-I, in confronto agli altri tipi di issues. I risultati hanno mostrato che la maggior parte dei prodotti hanno una percentuale di issues che identificano Debito Tecnico tra il 14.8% and 23.3%, che il debito di codice è spesso pagato nella prima settimana dalla sua documentazione, che la percentuale di debito pagato è tra il 68.86% e il 93.74%, e che gli utenti che creano queste issues sono anche i risolutori tra il 43.54% e il 93.08% dei casi.