An assessment of recent techniques for question difficulty estimation from text

In the educational domain, question difficulty estimation consists in estimating a numerical or categorical value representing the difficulty of an exam question. It is traditionally performed with manual calibration or pretesting, which have several limitations: indeed, they are either subjective or introduce a long delay between the time of question creation and when the new question can be used to assess students. Recent research tried to overcome these shortcomings by leveraging Natural Language Processing techniques to perform question difficulty estimation using as input only the textual content of the questions, which is the only information that is always available at the time of question creation. Specifically, research proceeded along two main directions: supervised and unsupervised approaches, which have peculiar advantages and limitations. This thesis explores previous literature in both research directions and evaluates several models, including novel approaches, on real world datasets coming from different educational domains. The experimental results show that model accuracy heavily depends on the characteristics of the questions under consideration and, most importantly, the educational domain: while simple models based on readability indexes and linguistic measures are generally fairly accurate on reading comprehension questions, the calibration of questions assessing domain knowledge requires more advanced models based on the attention mechanism and Transformers.

Nell'ambito dell'insegnamento, la "question difficulty estimation" consiste nella stima di un valore - numerico o categorico - che rappresenti la difficoltà di una certa domanda d'esame. Tradizionalmente, viene fatto manualmente o con "pretesting", ed entrambi questi approcci presentano problematiche: infatti, il primo è soggettivo, e il secondo introduce un lungo ritardo tra la fase di creazione delle domande e il momento in cui queste possono essere utilizzate per valutare gli studenti. In anni recenti, diversa attività di ricerca ha cercato di superare queste problematiche sfruttando tecniche di "Natural Language Processing" per fare "question difficulty estimation" usando come input solo il contenuto testuale delle domande, che è l'unica informazione che è sempre disponibile quando una nuova domanda viene creata. Nello specifico, la ricerca ha seguito due direzioni principali: approcci supervisionati e approcci non supervisionati, che hanno specifici vantaggi e problematiche. In questa tesi, presentiamo la letteratura proposta in entrambe queste direzioni e valutiamo diversi modelli, alcuni dei quali sono proposti per la prima volta in questa tesi, utilizzando datasets sperimentali provenienti da domini diversi (matematica, tecnologie informatiche, e lingua Inglese). I risultati sperimentali mostrano che l'accuratezza dei modelli dipende significativamente dalle caratteristiche delle specifiche domande e, soprattutto, dal dominio: semplici modelli basati su indici di "leggibilità" e misure linguistiche sono generalmente accurati su domande di comprensione del testo, mentre la stima della difficoltà di domande che valutano conoscenza di dominio richiede modelli più avanzati basati sul meccanismo dell'attenzione e "Transformers".