Don't overthink it: intermittent self-evaluation in reasoning language models playing textual games

Reasoning Language Models have remarkable problem-solving capabilities that bring them even closer to human performance compared to standard LLMs, albeit gaining two traits that are typical of human agents: an increased response time and a heightened risk of overthinking. We choose the text-based games of TextWorld as a comprehensive example of a complex task environment, and present two novel and related techniques that counteract high response time and overthinking: n-think and ephemerality. N-think models employ reasoning only every n turns and, in that turn, they follow a self-evaluation prompt that increases context awareness, recall, and performance; in all other turns, reasoning is deactivated and thus inference time is minimized. Ephemeral n-think models instead do not retain their thought process in the context once the self-evaluation turn ends, but only their final response; in this way, the game content is not diluted by excessive thinking. These techniques curtail answer length and context length respectively, which are two critical components that slow down inference and carry an increased risk of overthinking. The ephemeral 1-think configuration exhibits the highest performance by drastically reducing overthinking, whereas at higher values of n the impact of ephemerality is either negative or negligible. Non-ephemeral n-think with low n (e.g. 4) is also a promising configuration that noticeably reduces execution time with a small decrease in score. We then implement two successful improvements to the n-think technique, namely random n-think and Chain-of-Thought-based self-evaluation; perform a qualitative analysis on the behaviors and patterns exhibited during self-evaluation turns with and without CoT; and finally identify future developments to the framework like ask-to-think, dynamic n-think, semi-ephemerality, or an application in real-time games.

I modelli linguistici di ragionamento (RLM) possiedono notevoli capacità di problem solving che li avvicinano ancora di più alle prestazioni degli umani rispetto ai modelli linguistici (LLM) standard, sebbene guadagnando due tratti tipici proprio degli agenti umani: un tempo di risposta allungato e una maggiore tendenza a pensare troppo. Scegliamo le avventure testuali di TextWorld come esempio esauriente di attività complessa, e presentiamo due tecniche innovative e correlate che contrastano tempi di risposta e ragionamenti eccessivi: n-think ed effimerità. I modelli n-think ragionano solo un turno ogni n e, in quel turno, seguono un prompt di autovalutazione che aumenta consapevolezza del contesto, memoria, e performance; in tutti gli altri turni, il ragionamento è disattivato e quindi il tempo di inferenza è minimizzato. I modelli effimeri invece non conservano il loro ragionamento nel contesto una volta terminato il turno di autovalutazione, ma solo la loro risposta finale, evitando di diluire il contesto di gioco con ragionamenti eccessivi. Queste tecniche ridimensionano rispettivamente la lunghezza delle risposte e del contesto, due componenti critici che altrimenti rallentano l'inferenza e contribuiscono al rischio di rimuginamento. La configurazione 1-think effimero presenta le prestazioni migliori in quanto riduce drasticamente il ragionamento eccessivo, mentre con valori di n maggiori l'impatto dell'effimerità è negativo o trascurabile. L'n-think non effimero con n basso (ad es. 4) è un'altra configurazione promettente che riduce marcatamente il tempo di esecuzione con una piccola diminuzione del punteggio. Apportiamo poi con buoni risultati due miglioramenti alla tecnica dell'n-think, il random n-think e l'autovalutazione con Chain-of-Thought; effettuiamo un'analisi qualitativa dei comportamenti e pattern manifestati durante i turni di autovalutazione con e senza CoT; e infine identifichiamo sviluppi futuri per questo framework come semi-effimerità, n-think dinamico o a richiesta, e l'applicazione a giochi in tempo reale.