PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

1Fudan University, 2Soochow University, 3Shanghai AI Laboratory, 4Stony Brook University, 5The Chinese University of Hong Kong
Teaser Image

An overview of our PRMBench. The left part illustrates our data curation procedure. In the right part of the figure, we showcase demonstrations of our evaluation subjects and the relative performance of tested models, with green, yellow, and gray boxes indicating simplicity, soundness, and sensitivity respectively, where red circles represent erroneous steps and green circles indicate correct regular steps.

Abstract

Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.

Leaderboard on PRMBench

# Model Class Source Overall NR. NCL. S1 ES. SC. DC. CI. S2 PS. DR. MS. S3
1 Gemini-2.0-thinking-exp-1219 🥇 LM-C Link 68.8 68.5 63.8 66.2 72.9 71.3 71.0 71.8 71.8 60.3 65.7 99.8 75.3
2 o1-mini 🥈 LM-C Link 68.8 65.6 63.7 64.6 74.5 67.7 73.8 72.3 72.1 61.8 64.8 100.0 75.5
3 GPT-4o 🥉 LM-C Link 66.8 57.0 62.4 59.7 72.0 69.7 70.7 71.1 70.9 62.5 65.7 99.2 75.8
4 Gemini-2.0-flash-exp LM-C Link 66.0 67.2 58.1 62.7 70.4 65.7 66.0 67.3 67.3 61.8 66.2 98.2 75.4
5 QwQ-Preview-32B LM-O Link 63.6 57.2 55.6 56.4 67.4 72.3 66.2 66.9 68.2 57.8 62.7 100.0 73.5
6 ReasonEval-34B PRM Link 60.5 54.8 48.1 51.5 66.4 60.3 57.8 67.5 63.0 57.7 64.3 97.2 73.1
7 ReasonEval-7B PRM Link 60.0 61.0 50.1 55.6 62.1 65.9 61.5 65.9 63.8 55.6 57.9 99.5 71.0
8 RLHFlow-PRM-Mistral-8B PRM Link 54.4 46.1 47.3 46.7 56.6 55.1 54.4 63.8 57.5 51.5 56.2 97.9 68.5
9 RLHFlow-PRM-Deepseek-8B PRM Link 54.2 46.4 48.9 47.6 55.7 55.0 53.2 66.2 57.5 49.0 55.4 99.8 68.1
10 MATHMinos-Mistral-7B PRM Link 54.2 48.8 54.0 51.4 57.0 52.1 50.7 57.8 54.4 52.8 55.8 91.1 66.5
11 Llemma-PRM800k-7B PRM Link 52.0 49.3 53.4 51.4 56.4 47.1 46.7 53.3 50.9 51.0 53.5 93.6 66.0
12 Llemma-MetaMath-7B PRM Link 50.5 50.2 50.5 50.3 51.9 47.6 44.4 52.1 49.0 50.5 51.3 96.0 66.0
13 Llemma-oprm-7B PRM Link 50.3 48.7 49.3 49.0 54.2 46.8 44.5 53.5 49.8 49.2 51.3 91.8 64.1
14 MathShepherd-Mistral-7B PRM Link 47.0 44.0 50.3 47.1 49.4 44.5 41.3 47.7 45.7 47.2 48.6 86.1 60.7
15 Skywork-PRM-7B PRM Link 36.2 35.7 41.2 38.4 36.7 29.1 30.6 34.4 32.7 36.8 37.4 88.8 54.3
16 Skywork-PRM-1.5B PRM Link 31.7 31.4 35.8 33.6 32.4 25.7 26.0 30.2 28.6 33.1 32.3 81.1 48.8
Model Classes: PRM: Process-Level Reward Models, LM-C: Close-Source Large Language Models, LM-O: Open-Source Large Language Models.
Task types: NR: Non-Redundancy, NCL: Non-Circular Logic, ES: Empirically Soundness, SC: Step Consistency, DC: Domain Consistency. CI: Confidence Invariance. PS: Prerequisite Sensitivity. DR: Deception Resistance. MS: Multi-Solution Consistency.
Evaluation Subjects: S1: Average PRMScore for Simplicity evaluation subjects, including NR. and NCL.
S2: Average PRMScore for Soundness evaluation subjects, including ES., SC., DC., and CI.
S3: Average PRMScore for Sensitivity evaluation subjects, including PS., DR., and MS.

BibTeX

@article{song2025prmbench,
  title={PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models},
  author={Mingyang Song and Zhaochen Su and Xiaoye Qu and Jiawei Zhou and Yu Cheng},
  journal={arXiv preprint arXiv:2501.03124},
  year={2025},
  url={https://arxiv.org/pdf/2501.03124}
}