Explainable agglomerative clustering tree: ExACt

As Machine Learning models (ML) continue to be at the epicenter of many decision-making tasks, the need for explainable and human-comprehensible results has never been more important. In many real world applications such as healthcare, fraud detection and regulatory compliance, having human-interpretable results is of the utmost importance in building and maintaining trust. For instance, in finance if a system prohibits a transaction there has to be clear and obvious reasons behind it. While numerous studies have been proposed to improve the interpretability of supervised models such as artificial neural network, explainability in clustering tasks is often left behind. In this thesis we present ExACt (Explainable Agglomerative Clustering Tree): a decision tree-based algorithm that builds upon the dendrogram structure from agglomerative clustering and leverages other decision tree algorithms to provide meaningful explanations, taking advantage of their interpretable by design status. ExACt builds a decision tree for every level of the dendrogram and then simplifies it with respect to the original clustering to obtain a more compact solution. The key idea is that by traversing the dendrogram structure, clusters will have simpler shapes that allow for more meaningful axis-parallel splits. Later, the simplification of these splits will lead to more compact solutions. We also define an explainability metric that balances fit versus compactness of the output tree called the "explainability index", that helps our algorithm in selecting the most optimal tree. We test the final solution on both real and synthetic datasets showing the benefits of our approach compared to standard tree building algorithms. Finally we try ExACt in a real world scenario of regulatory compliance and Anti Money Laundering (AML): finding and explaining the "paper companies" phenomenon, meaning companies that are just so on paper that present similar characteristics in their financial statements. With ExACt, we offer an exploratory tool to help explain cluster assignments and to improve the overall trust in the clustering framework.

I modelli di Machine Learning (ML) sono sempre più al centro delle nostre attività decisionali, portando la necessità e l'attenzione sull'avere risultati spiegabili e comprensibili dal punto di vista umano. In molte applicazioni, dalla sanità ai sistemi di rilevamento di frodi fino alla conformità normativa, avere risultati interpretabili da un umano è di fondamentale importanza per stabilire e mantenere la fiducia. Ad esempio, in ambito finanziario, se un sistema blocca una determinata transazione, le motivazioni dietro questa scelta devono essere chiare. Molti degli studi in letteratura mirano ad affrontare l'interpretabilità di sistemi supervised come le reti neurali, mentre la spiegabilità del clustering è spesso trascurata. In questa tesi si propone ExACt (Explainable Agglomerative Clustering Tree): un algoritmo basato sugli alberi decisionali che sfrutta la struttura del dendrogramma dell'agglomerative clustering e utilizza algoritmi di generazione di alberi decisionali per fornire spiegazioni significative, sfruttando la loro natura intrinsecamente spiegabile. ExACt costruisce un albero decisionale per ogni livello del dendrogramma per poi semplificarlo rispetto al clustering originale ed ottenere una soluzione più compatta. L'idea chiave è che, attraversando la struttura del dendrogramma, i cluster avranno forme più semplici, permettendo la generazione di divisioni parallele agli assi più significative. Alcune di queste divisioni saranno semplificate rispetto al clustering originale per ottenere un albero più compatto. Definiamo inoltre una nuova metrica che bilancia l'accuratezza con la semplicità dell'albero prodotto come output, chiamata "explainability index". Quest'ultima viene usata in ExACt per scegliere l'albero migliore tra quelli prodotti ai vari livelli del dendrogramma. Testiamo la soluzione finale sia su dataset reali che sintetici, mostrando i benefici rispetto agli algoritmi di alberi decisionali standard. Infine, applichiamo ExACt a uno scenario reale nel mondo dell'antiriciclaggio (Anti Money Laundering): scovare e spiegare il fenomeno delle cosiddette "società cartiere", ovvero società esistenti solo sulla carta e che presentano caratteristiche di bilancio simili tra loro. Con ExACt offriamo uno strumento di analisi esplorativa che aiuta a spiegare l'assegnazione dei cluster e a migliorare la fiducia complessiva nel framework.