# | Model | Class | Source | Overall | NR. | NCL. | S1 | ES. | SC. | DC. | CI. | S2 | PS. | DR. | MS. | S3 |
1 | Gemini-2.0-thinking-exp-1219 🥇 | LM-C | Link | 68.8 | 68.5 | 63.8 | 66.2 | 72.9 | 71.3 | 71.0 | 71.8 | 71.8 | 60.3 | 65.7 | 99.8 | 75.3 |
2 | o1-mini 🥈 | LM-C | Link | 68.8 | 65.6 | 63.7 | 64.6 | 74.5 | 67.7 | 73.8 | 72.3 | 72.1 | 61.8 | 64.8 | 100.0 | 75.5 |
3 | Qwen2.5-Math-PRM-72B 🥉 | PRM | Link | 68.2 | 50.4 | 58.8 | 54.6 | 73.7 | 71.1 | 72.2 | 78.6 | 73.9 | 60.3 | 71.2 | 99.4 | 77.0 |
4 | GPT-4o | LM-C | Link | 66.8 | 57.0 | 62.4 | 59.7 | 72.0 | 69.7 | 70.7 | 71.1 | 70.9 | 62.5 | 65.7 | 99.2 | 75.8 |
5 | Gemini-2.0-flash-exp | LM-C | Link | 66.0 | 67.2 | 58.1 | 62.7 | 70.4 | 65.7 | 66.0 | 67.3 | 67.3 | 61.8 | 66.2 | 98.2 | 75.4 |
6 | Qwen2.5-Math-PRM-7B | PRM | Link | 65.5 | 49.0 | 55.1 | 52.1 | 71.8 | 67.3 | 66.3 | 78.5 | 71.0 | 57.6 | 69.1 | 99.7 | 75.5 |
7 | Pure-PRM-7B | PRM | Link | 65.3 | 49.2 | 55.2 | 52.2 | 71.1 | 68.8 | 64.0 | 76.9 | 70.2 | 60.3 | 69.2 | 98.0 | 75.8 |
8 | Skywork-PRM-7B | PRM | Link | 65.1 | 56.4 | 62.8 | 59.6 | 69.4 | 67.1 | 67.7 | 69.9 | 68.5 | 60.9 | 65.8 | 93.2 | 73.3 |
9 | QwQ-Preview-32B | LM-O | Link | 63.6 | 57.2 | 55.6 | 56.4 | 67.4 | 72.3 | 66.2 | 66.9 | 68.2 | 57.8 | 62.7 | 100.0 | 73.5 |
10 | Skywork-PRM-1.5B | PRM | Link | 61.1 | 52.0 | 56.4 | 54.2 | 64.8 | 64.9 | 63.3 | 66.5 | 64.9 | 57.5 | 63.3 | 91.1 | 70.7 |
11 | ReasonEval-34B | PRM | Link | 60.5 | 54.8 | 48.1 | 51.5 | 66.4 | 60.3 | 57.8 | 67.5 | 63.0 | 57.7 | 64.3 | 97.2 | 73.1 |
12 | R1-Distill-Qwen-32B | LM-O | Link | 60.2 | 57.2 | 51.9 | 54.5 | 66.1 | 68.4 | 69.3 | 64.8 | 67.2 | 53.3 | 54.6 | 99.9 | 69.3 |
13 | ReasonEval-7B | PRM | Link | 60.1 | 61.0 | 50.1 | 55.6 | 62.1 | 65.9 | 61.5 | 66.0 | 63.9 | 55.7 | 58.0 | 99.5 | 71.1 |
14 | R1-Distill-Llama3.1-70B | LM-O | Link | 57.5 | 49.5 | 48.1 | 48.8 | 61.4 | 65.5 | 65.8 | 61.1 | 63.4 | 48.8 | 54.1 | 100.0 | 67.6 |
15 | Qwen2.5-Math-72B | LM-O | Link | 57.4 | 55.3 | 54.9 | 55.1 | 55.5 | 71.6 | 58.1 | 59.1 | 61.1 | 47.4 | 53.8 | 100.0 | 67.1 |
16 | RLHFlow-PRM-Mistral-8B | PRM | Link | 54.4 | 46.1 | 47.3 | 46.7 | 56.6 | 55.1 | 54.4 | 63.8 | 57.5 | 51.5 | 56.2 | 97.9 | 68.5 |
17 | RLHFlow-PRM-Deepseek-8B | PRM | Link | 54.2 | 46.4 | 48.9 | 47.6 | 55.7 | 55.0 | 53.2 | 66.2 | 57.5 | 49.0 | 55.4 | 99.8 | 68.1 |
18 | MATHMinos-Mistral-7B | PRM | Link | 54.2 | 48.8 | 54.0 | 51.4 | 57.0 | 52.1 | 50.7 | 57.8 | 54.4 | 52.8 | 55.8 | 91.1 | 66.5 |
19 | R1-Distill-Llama3.1-8B | LM-O | Link | 52.7 | 49.9 | 46.6 | 48.2 | 52.9 | 63.2 | 54.5 | 54.9 | 56.4 | 46.7 | 48.5 | 100.0 | 65.0 |
20 | R1-Distill-Qwen-7B | LM-O | Link | 52.6 | 32.9 | 37.9 | 35.4 | 47.3 | 54.1 | 48.4 | 48.0 | 49.4 | 45.6 | 46.8 | 100.0 | 64.1 |
21 | Llemma-PRM800k-7B | PRM | Link | 52.0 | 49.3 | 53.4 | 51.4 | 56.4 | 47.1 | 46.7 | 53.3 | 50.9 | 51.0 | 53.5 | 93.6 | 66.0 |
22 | Llemma-MetaMath-7B | PRM | Link | 50.5 | 50.2 | 50.5 | 50.3 | 51.9 | 47.6 | 44.4 | 52.1 | 49.0 | 50.5 | 51.3 | 96.0 | 66.0 |
23 | Llemma-oprm-7B | PRM | Link | 50.3 | 48.7 | 49.3 | 49.0 | 54.2 | 46.8 | 44.5 | 53.5 | 49.8 | 49.2 | 51.3 | 91.8 | 64.1 |
24 | MetaMath-7B | LM-O | Link | 49.7 | 48.9 | 46.9 | 47.9 | 47.3 | 48.9 | 48.4 | 48.8 | 48.3 | 46.5 | 48.3 | 98.0 | 64.2 |
25 | MetaMath-13B | LM-O | Link | 49.4 | 50.3 | 44.4 | 47.3 | 47.8 | 47.4 | 49.4 | 48.1 | 48.2 | 49.0 | 48.1 | 99.5 | 65.5 |
26 | MathShepherd-Mistral-7B | PRM | Link | 47.0 | 44.0 | 50.3 | 47.1 | 49.4 | 44.5 | 41.3 | 47.7 | 45.7 | 47.2 | 48.6 | 86.1 | 60.7 |
Task types: NR: Non-Redundancy, NCL: Non-Circular Logic, ES: Empirically Soundness, SC: Step Consistency, DC: Domain Consistency. CI: Confidence Invariance. PS: Prerequisite Sensitivity. DR: Deception Resistance. MS: Multi-Solution Consistency.
Evaluation Subjects: S1: Average PRMScore for Simplicity evaluation subjects, including NR. and NCL.
S2: Average PRMScore for Soundness evaluation subjects, including ES., SC., DC., and CI.
S3: Average PRMScore for Sensitivity evaluation subjects, including PS., DR., and MS.