Remarkably, the outputs generated by power sampling from the base model are on par with, if not better than, RL-posttraining on a variety of reasoning tasks and base models. We look at MATH500, HumanEval, and GPQA Diamond as benchmarks of difficult mathematics, coding, and science questions. We compare against a GRPO baseline (using the MATH training dataset), the poster child for RL-posttraining, as well as the original base model itself. We also include the AlpacaEval 2.0 benchmark, a non-verifiable, general helpfulness benchmark, to demonstrate our applicability beyond the verifiable regime.
In-domain (MATH500), power sampling surprisingly is close to the performance of GRPO without ever changing the base model's weights. Out-of-domain, power sampling can actually outperform GRPO, as demonstrated on HumanEval and AlpacaEval 2.0.
