Exploring redundancy and shared representations for transformer models optimization

Large language models have revolutionized natural language processing enabling the deployment of advanced AI systems across a vast range of domains. Their rapid advances in language understanding and reasoning make them powerful tools, but their growing size leads to challenges in terms of efficiency, cost, and accessibility. Optimizing LLM to minimize resource consumption is essential to ensure their sustainable use. This thesis explores structural and weight redundancies in Transformer-based architectures, aiming to identify inefficiencies and leverage them through targeted compression techniques. A central focus is assessing whether different modules perform overlapping functions. Although some degree of similarity is observed in the analyzed cases, redundancy proves to be lower than expected, challenging the assumption that weight matrices can be interchanged across layers without compromising performance. Additionally, an analysis of model matrices examines whether they exhibit an inherently low-rank structure. To further explore these aspects, three novel compression methods are introduced: MASS, which enforces weight aggregation and sharing; GlobaL Fact, which factorizes matrices using shared components; and ABACO, which provides low-rank approximations through a continuous compression process. Experimental results indicate that while these approaches reduce the number of parameters, their limited ability to preserve performance hinders their practical viability. The findings highlight the complexity of extracting redundancy from Transformer architectures, raising questions about its extent across layers and blocks. By addressing these challenges, this thesis aims to contribute to ongoing efforts to improve the efficiency of LLM.

I large language models hanno rivoluzionato l’elaborazione del linguaggio naturale, rendendo possibile l'implementazione di innovativi sistemi basati sull'intelligenza artificiale in un'ampia gamma di settori. Tuttavia la crescente complessità di queste architetture comporta sfide significative in termini di efficienza computazionale, costi e accessibilità. Ottimizzare tali modelli per ridurre il consumo di risorse è dunque essenziale per garantirne un utilizzo sostenibile. Questa tesi indaga il livello di ridondanza presente nelle architetture basate sul Transformer, con l’obiettivo di identificare inefficienze e sfruttarle attraverso tecniche mirate di compressione. L’analisi si focalizza sulla verifica di possibili sovrapposizioni funzionali tra moduli distinti. Sebbene nei casi analizzati emerga una minima similarità tra le loro operazioni, la ridondanza si rivela inferiore alle aspettative, mettendo in discussione l’ipotesi che le matrici dei pesi possano essere scambiate tra i layers senza deteriorare le prestazioni. Inoltre lo studio esamina la struttura delle matrici di diversi LLM per verificare che esse presentino un rango ridotto. Per sfruttare potenziali ridondanze, vengono introdotti tre nuovi metodi di compressione: MASS, che sfrutta l’aggregazione e la condivisione dei pesi; GlobaL Fact, che fattorizza i layers utilizzando matrici condivise; e ABACO, che fornisce approssimazioni a rango ridotto attraverso un processo di compressione continuo. I risultati sperimentali indicano che, sebbene questi approcci riducano il numero di parametri, la loro limitata capacità di preservare le prestazioni ne compromette le possibilità di utilizzo. Le evidenze raccolte sottolineano la complessità insita nell'estrazione della ridondanza dai Transformers, sollevando interrogativi sulla sua effettiva estensione all’interno di layers e blocchi. Tramite le analisi condotte, questa tesi aspira a contribuire agli sforzi in corso per migliorare l’efficienza dei large language models.