PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

1Fudan University, 2Soochow University, 3Shanghai AI Laboratory, 4Stony Brook University, 5The Chinese University of Hong Kong
Teaser Image

An overview of our PRMBench. The left part illustrates our data curation procedure. In the right part of the figure, we showcase demonstrations of our evaluation subjects and the relative performance of tested models, with green, yellow, and gray boxes indicating simplicity, soundness, and sensitivity respectively, where red circles represent erroneous steps and green circles indicate correct regular steps.

Tutorial Video

Leaderboard on PRMBench

PRM scores on PRMBench. Click on Simplicity, Soundness, or Sensitivity to expand detailed results for each category.

PRMs Proprietary LLMs Open LLMs

Reset Overall Simplicity Soundness Sensitivity
Model Overall Overall Overall Overall

Overall results of different models on the PRMBench leaderboard. The best-performing model in each category is in-red, and the second best is underlined.


Task types: NR: Non-Redundancy, NCL: Non-Circular Logic, ES: Empirically Soundness, SC: Step Consistency, DC: Domain Consistency. CI: Confidence Invariance. PS: Prerequisite Sensitivity. DR: Deception Resistance. MS: Multi-Solution Consistency.

Analysis and Takeaways

We conducted an in-depth analysis of PRMBench, and here we present what we consider to be the most important findings and takeaways.
PRMBench-STEM Overview

Detailed Classifications

The detailed classification of PRMBench. Here we display the hierarchical categories, corresponding descriptions, and illustrations of PRMBench classification design.
PRMBench-STEM Overview

Examples

An example of a data instance and error cases from PRMBENCH. The numbers reported are step-level validity scores generated by models. Scores and labels in red indicate negative samples, while those in green indicate positive samples. “GT” represents ground truth, while “ReasonE,” “MathS,” and “Gemini” correspond to ReasonEval-7B, Math-Shepherd-7B, and Gemini-2.0-flash-thinkin
PRMBench-STEM Overview

Abstract

Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.

BibTeX

@article{song2025prmbench,
  title={PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models},
  author={Mingyang Song and Zhaochen Su and Xiaoye Qu and Jiawei Zhou and Yu Cheng},
  journal={arXiv preprint arXiv:2501.03124},
  year={2025},
  url={https://arxiv.org/pdf/2501.03124}
}