Robustness evaluation of language models through adversarial attack strategies

The last few years have seen the birth of innovative models in the field of Natural Language Processing, which have obtained state-of-the-art results in many classical benchmark datasets such as GLUE and SQuAD. The various successes have been collected by pre-trained language models, capable of sharing the same basic structure, which should capture the general characteristics of human language, between different tasks. This approach not only avoids the expensive computational cost of the training process, but also allows the creation of models able to solve a specific function effectively without the need of large amount of annotated data. However, their complex and deep architectures, highly indecipherable, which act like black-boxes, make their in-depth analysis difficult: recent studies have brought to light serious intrinsic vulnerabilities in their structures, that can be identified using strategically modified inputs, called adversarial examples, able to modify their normal working. This research aims to evaluate the robustness of two important NLP models, BERT from Google and RoBERTa from Facebook AI. Through elementary transformations of the sentences injected as input, such as changing the order of words or inserting synonyms, it is possible to better understand which features of the language are used for predictions. The study also proposes a simple evolutionary algorithm for the generation of sentences capable of fooling the reference system in different scenarios, that can be adopted for any NLP task thanks to its flexibility. The experiments underlined the main differences between the two models, marking the superiority of RoBERTa, which however is not free from effective attacks.

Gli ultimi anni hanno visto la nascita di modelli all'avanguardia nel campo del Natural Language Processing con risultati sempre più soddisfacenti tra i vari benchmark come GLUE e SQuAD. Questi successi sono stati collezionati soprattutto da modelli di linguaggio pre-allenati, che condividono la stessa struttura base, in grado di cogliere caratteristiche generali del linguaggio umano, per i diversi compiti. Questo sistema non solo evita il dispendioso costo computazionale del processo di addestramento, ma permette la creazione di modelli in grado di risolvere una precisa funzione in modo efficace, senza la necessità di un'enorme quantità di dati annotati a dispozione. Le loro complesse architetture, altamente indecifrabili, rendono però difficile una loro analisi approfondita: recenti studi hanno portato alla luce gravi vulnerabilità, identificabili solo con l'utilizzo di input strategicamente modificati, in grado di alterare il loro normale funzionamento. Questa ricerca si pone come obiettivo quello di analizzare la robustezza di due importanti modelli di linguaggio, BERT di Google e RoBERTa di Facebook AI. Attraverso semplici trasformazioni delle frasi ricevute in ingresso, come la modifica dell'ordine delle parole o l'inserimento di sinonimi, è possibile comprendere meglio quali caratteristiche del linguaggio sono utilizzate per la predizione dell'output. Lo studio propone inoltre un algoritmo evolutivo per la generazione di frasi in grado di ingannare un modello di linguaggio in diversi scenari, che può essere utilizzato per qualunque task di NLP grazie alla sua flessibilità. Questi esperimenti hanno sottolineato le grandi differenze tra i due modelli, marcando la superiorità di RoBERTa, che tuttavia non è esente da attacchi efficaci.