Do we really need GRPO? Rethinking reinforcement learning for reasoning in LLMs

Recent advancements in large language models (LLMs) have highlighted the critical role that post-training techniques play in enhancing reasoning and mathematical capabilities. Among these techniques, Group Relative Policy Optimization (GRPO) has demonstrated significant promise in encouraging complex reasoning behaviors. However, GRPO’s design, which combines group-relative advantage estimation, PPO-style clipping, and KL regularization, introduces considerable complexity. In this work, we investigate whether this complexity is indeed necessary for mathematical reasoning by systematically simplifying the GRPO loss function. Our findings emphasize the importance of including negative feedback during training, demonstrating that selectively reinforcing only actions that exceed a baseline can hinder learning. On the other hand, we observe that PPO-style constraints, such as policy ratio clipping, are not strictly necessary for improving mathematical performance or for facilitating the emergence of reasoning behaviors. Building upon these insights, we propose a simplified REINFORCE variant called RGRA, which omits the PPO-style clipping and policy ratio components, but retains GRPO’s group-relative advantage estimation. Experimental evaluations indicate that RGRA achieves enhanced performance on key mathematical benchmarks compared to the original GRPO method. Overall, our results suggest that simpler REINFORCE-based methods can effectively enhance mathematical capabilities and foster reasoning abilities in LLMs, reducing complexity without compromising performance.

I recenti progressi nei large language models (LLM) hanno messo in evidenza il ruolo cruciale delle tecniche di post-training nel migliorare le capacità di ragionamento e matematiche. Tra queste tecniche, la Group Relative Policy Optimization (GRPO) ha mostrato un notevole potenziale nel promuovere lo sviluppo di comportamenti di ragionamento complessi. Tuttavia, l’architettura di GRPO, che combina la group-relative advantage estimation, il clipping in stile PPO e la regolarizzazione KL, introduce una complessità considerevole. In questo lavoro, indaghiamo se tale complessità sia effettivamente necessaria, semplificando sistematicamente GRPO. I nostri risultati evidenziano l’importanza di includere esempi negativi durante il training, dimostrando che rinforzare selettivamente solo le azioni che superano una baseline può ostacolare l’apprendimento. Inoltre, osserviamo che i vincoli in stile PPO, come il clipping e il rapporto tra policy, non sono strettamente necessari per migliorare le prestazioni matematiche o per facilitare l’emergere del ragionamento. Sulla base di queste osservazioni, proponiamo una variante semplificata basata su REINFORCE chiamata RGRA, che elimina i componenti di clipping e il rapporto tra policy tipici di PPO, ma mantiene l'advantage estimation introdotta da GRPO. Le analisi sperimentali indicano che RGRA ottiene prestazioni superiori su benchmark matematici rispetto a GRPO. Nel complesso, i nostri risultati suggeriscono che metodi semplificati basati su REINFORCE possono migliorare efficacemente le capacità matematiche e favorire lo sviluppo del ragionamento nei LLM, riducendo la complessità senza compromettere le prestazioni.