Precision vs. efficiency: a bit-level variable-length floating-point approach to neural network quantization

In recent times, thanks to the presence of large datasets which new and increasingly complex deep learning models can be trained on, there has been an exponential increase in the size of these models, measured in the number of weights. Considering Large Language Models (LLMs), which have had a recent disruptive impact, the number of model parameters used has increased by about 10x each year since 2018. In this context, questions arise whether it is possible to optimize these models to make them more compact, improving the impact that training and usage have on the computational resources required, and consequently on the power consumption and associated costs. This thesis investigates one of the most widely used techniques, called quantization, used for this purpose. In particular, it investigates the effectiveness that a bit-level variable-length floating-point quantization can have in reducing the number of bits employed to represent tensors of activation functions and weights, while maintaining the predictive accuracy of the model. A lossless quantization-oriented methodology that numerically studies the distribution of weights and activation functions tensors combined with the implementation of a variable-length floating-point quantizer are presented. To test the performance, the results of four experiments in which the methodology is applied to a recurrent neural network are presented. Leveraging post-train quantization (PTQ) and quantization-aware training (QAT) techniques, it is shown that a floating-point quantization applied to a recurrent neural network can lead to the reduction of about 3.5x the number of bits of tensor weights and activation functions with a loss of top-1 prediction accuracy of only 0.5%.

In tempi recenti, grazie alla presenza di larghi dataset su cui poter allenare nuovi e sempre più complessi modelli di deep learning, si sta assistendo ad un incremento esponenziale nella dimensione di questi modelli, misurata dal numero di pesi utilizzati dal modello. Considerando i Large Language Models (LLM), che hanno recentemente avuto un impatto dirompente, dal 2018 il numero dei parametri dei modelli utilizzati è aumentato di circa 10x ogni anno. In questo contesto ci si domanda se sia possibile ottimizzare questi modelli per renderli più compatti, migliorando l’impatto che l’allenamento e l’utilizzo di questi ha sulle risorse computazionali richieste, e conseguemente sul consumo di potenza e i costi associati. Questa tesi indaga su una tecnica fra le più utilizzate, detta quantizzazione, adottata a questo scopo. In particolare indaga sull’efficacia che una quantizzazione bit-level variable-length floating-point possa avere nella riduzione del numero di bit utilizzati per rappresentare tensori di funzioni di attivazione e pesi, mantenendo inalterata l’accuratezza predittiva del modello. Una metodologia orientata alla quantizzazione lossless che studia numericamente la distribuzione dei tensori di pesi e funzioni di attivazione combinata con l’implementazione di un quantizzatore variable-length floating-point vengono presentate. Per testarne le performance, vengono presentati i risultati di quattro esperimenti in cui la metodologia è applicata ad una rete neurale ricorrente. Sfruttando tecniche di quantizzazione post-train quantization (PTQ) e quantization-aware training (QAT) si mostra che una quantizzazione floating-point applicata ad una rete neurale ricorrente può portare alla riduzione di circa 3.5x il numero di bit di tensori di pesi e funzioni di attivazione con una perdita di accuratezza di predizione top-1 di solamente lo 0.5%.