Privacy-preserving transformer encoder for automatic speech recognition

Nowdays, services based on Machine Learning and Deep Learning models have become pervasive, raising concerns about the users privacy and their data management. This is particularly true when the processed data contains sensitive personal information that can be misused, such as acoustic features. This study is aimed to the development of a privacy-preserving transformerbased encoder for Automatic Speech Recognition (ASR), to employ for the extraction of an acoustic context from spoken audio recordings. The model ensures the user’s anonymity by concealing the input values and results through the use of homomorphic encryption techniques. Subsequently, the extracted acoustic context can be given to a decoder to complete the ASR process and derive the audio transcription. The work unfolds into two parallel challenges. On one hand, due to the limitations of homomorphic encryption, it is necessary to first develop an encoder that operates solely with additions and products: as a starting point it is chosen the current state-of-the-art for end-to-end speech recognition, Whisper, of which the encoder is modified and approximated, by focusing on maintaining as much of its accuracy as possible. On the other hand, the found encoder is integrated with a homomorphic encryption scheme, for which a new packing method is proposed, specifically designed to be compatible with transformers; with the aim to reduce memory requirements and significantly increase the speed of the inference process. In addition to the final solution, a key result of this work is the successful application of the proposed packing method, which enables the possibility to work with matrices of dimensions that are generally considered prohibitive when using homomorphic encryption.

In questa era di crescente digitalizzazione, i servizi offerti basati su modelli di Machine Learning e Deep Learning sono diventati pervasivi, causando preoccupazioni riguardanti la privacy degli utenti e la gestione dei loro dati. Questo risulta maggiormente vero quando i dati trattati contengono informazioni personali sensibili che possono essere utilizzate impropriamente, come le impronte vocali. Questo studio si propone di sviluppare un privacy-preserving transformer-based encoder per l’estrazione del contesto acustico mirato al Riconoscimento Vocale Automatico (ASR). L’encoder garantisce l’anonimato dell’utente offuscando le impronte vocali in input ed i risultati delle computazioni tramite tecniche di criptazione omomorfica. Il contesto acustico trovato in questo modo può essere successivamente impiegato da un decoder per completare il processo di ASR e ricavare la trascrizione dell’audio. Il lavoro affronta due sfide parallele. Da un lato, a causa delle limitazioni della crittografia omomorfica, è necessario prima sviluppare un encoder che funzioni unicamente con somme e prodotti: come architettura di partenza è scelto l’attuale stato dell’arte per il riconoscimento vocale end-to end, Whisper, il cui encoder viene modificato e approssimato, concentrandosi sul perdere la minor accuratezza possibile rispetto all’originale. Dall’altro, l’encoder trovato in questo modo viene integrato con uno schema di crittografia omomorfica, per il quale si propone un nuovo metodo di packing ideato appositamente per essere compatibile con i transformer; allo scopo di ridurre i requisiti di memoria ed aumentare drasticamente la velocità del processo di inferenza. Oltre alla soluzione finale, un risultato chiave di questo lavoro è l’aver applicato con successo il metodo di packing proposto, abilitando la possibilità di lavorare su matrici di dimensioni che, quando si impiega la crittografia omomorfica, generalmente sono considerate proibitive.