Transformers for question difficulty estimation from text

Question Difficulty Estimation (QDE), a process which is also referred to as question calibration, is a very important task in education. Indeed, the knowledge level of students, also called skill, can be estimated from the correctness of their answers to exam questions and their difficulty. An accurate estimation of question difficulty can also be leveraged to provide students with exercises suitable for their skill level. Conventional approaches to question calibration are manual calibration and pretesting. In manual calibration, one or more domain experts assign to each question a numerical value representing the difficulty, and this is intrinsically subjective. In pretesting, questions are administered to students in a real test scenario, and then the difficulty is estimated from the correctness of their answers. Pretesting introduces a long delay between the time of question generation and when the question can be used to score students. Recent research tried to overcome this issue by estimating the difficulty of questions using only their textual information, exploiting Natural Language Processing (NLP) techniques such as neural models or bag of words. The idea behind this is to reduce (or eliminate) the need for manual calibration and pretesting by estimating the difficulty of questions from their text, which is immediately available at the moment of question creation. Pre-trained language models, especially Transformers, have led to impressive gains on several NLP tasks, but no previous work has explored their use for question calibration. In this work, we perform a study of how Transformer models (specifically, BERT and DistilBERT) compare with the current state of the art in the task of QDE from text, and propose a model which is capable of outperforming previous research. Our model is trained on the text of questions and their difficulty, but can optionally take advantage of an additional corpus of domain-related documents to improve performance. Tests on two different real-world datasets, one public and one private, show that our model reduces the Root Mean Square Error (RMSE) of previous baselines by up to 6.5% and confirms our intuition about the effectiveness of Transformer-based models for QDE from text. Furthermore, we carry out an analysis on which characteristics of the questions (such as length of the text and presence of numbers) can influence the prediction error.

La calibrazione delle domande, ovvero la stima della loro difficoltà, è una componente molto importante dell’educazione. Infatti, il livello di conoscenza degli studenti può essere stimato dalla correttezza delle loro risposte alle domande dell'esame e dalla loro difficoltà. Una stima accurata della difficoltà delle domande può anche essere sfruttata per fornire agli studenti esercizi adatti al loro livello di abilità. Gli approcci tradizionali alla calibrazione delle domande sono la calibrazione manuale e il pre-test. Nella calibrazione manuale, uno o più esperti assegnano a ciascuna domanda un valore numerico che ne rappresenta la difficoltà, e questo è intrinsecamente soggettivo. Nel pre-test, le domande vengono somministrate agli studenti in un vero esame e successivamente la difficoltà è stimata a partire dalla correttezza delle loro risposte. Il pre-test introduce un lungo ritardo tra il momento della generazione della domanda e il momento in cui può essere utilizzata per valutare gli studenti. Ricerche recenti hanno cercato di risolvere questo problema stimando la difficoltà delle domande usando solo le loro informazioni testuali, sfruttando tecniche di Natural Language Processing (NLP) come modelli neurali o bag of words. L'idea alla base di ciò è ridurre (o eliminare) la necessità di calibrazione manuale e di pre-test, stimando la difficoltà delle domande dal loro testo, che è immediatamente disponibile dopo la creazione della domanda. I modelli linguistici pre-addestrati, in particolare i Transformers, hanno portato a notevoli miglioramenti in diverse aree del NLP, ma finora nessuno studio ha esplorato il loro utilizzo per la calibrazione delle domande. In questo lavoro, eseguiamo uno studio su come i modelli Transformers (in particolare, BERT e DistilBERT) si confrontano con lo stato dell'arte attuale nella stima della difficoltà dal testo e proponiamo un modello che è in grado di migliorare il precedente stato dell’arte. Il nostro modello è addestrato utilizzando il testo e la difficoltà delle domande, ma può opzionalmente sfruttare un corpus aggiuntivo di documenti per migliorare le prestazioni. Test effettuati su due diversi set di dati, uno pubblico e uno privato, mostrano che il nostro modello riduce la radice del valore quadratico medio (in inglese Root Mean Square Error, RMSE) degli studi precedenti fino al 6,5% e conferma la nostra intuizione sull'efficacia dei modelli basati su Transformers per stimare la difficoltà dal testo delle domande. Inoltre, analizziamo quali caratteristiche delle domande possono influire sull’errore di predizione del modello.