# | Model | Class | Source | Overall | NR. | NCL. | S1 | ES. | SC. | DC. | CI. | S2 | PS. | DR. | MS. | S3 |
1 | Gemini-2.0-thinking-exp-1219 🥇 | LM-C | Link | 68.8 | 68.5 | 63.8 | 66.2 | 72.9 | 71.3 | 71.0 | 71.8 | 71.8 | 60.3 | 65.7 | 99.8 | 75.3 |
2 | o1-mini 🥈 | LM-C | Link | 68.8 | 65.6 | 63.7 | 64.6 | 74.5 | 67.7 | 73.8 | 72.3 | 72.1 | 61.8 | 64.8 | 100.0 | 75.5 |
3 | GPT-4o 🥉 | LM-C | Link | 66.8 | 57.0 | 62.4 | 59.7 | 72.0 | 69.7 | 70.7 | 71.1 | 70.9 | 62.5 | 65.7 | 99.2 | 75.8 |
4 | Gemini-2.0-flash-exp | LM-C | Link | 66.0 | 67.2 | 58.1 | 62.7 | 70.4 | 65.7 | 66.0 | 67.3 | 67.3 | 61.8 | 66.2 | 98.2 | 75.4 |
5 | QwQ-Preview-32B | LM-O | Link | 63.6 | 57.2 | 55.6 | 56.4 | 67.4 | 72.3 | 66.2 | 66.9 | 68.2 | 57.8 | 62.7 | 100.0 | 73.5 |
6 | ReasonEval-34B | PRM | Link | 60.5 | 54.8 | 48.1 | 51.5 | 66.4 | 60.3 | 57.8 | 67.5 | 63.0 | 57.7 | 64.3 | 97.2 | 73.1 |
7 | ReasonEval-7B | PRM | Link | 60.0 | 61.0 | 50.1 | 55.6 | 62.1 | 65.9 | 61.5 | 65.9 | 63.8 | 55.6 | 57.9 | 99.5 | 71.0 |
8 | RLHFlow-PRM-Mistral-8B | PRM | Link | 54.4 | 46.1 | 47.3 | 46.7 | 56.6 | 55.1 | 54.4 | 63.8 | 57.5 | 51.5 | 56.2 | 97.9 | 68.5 |
9 | RLHFlow-PRM-Deepseek-8B | PRM | Link | 54.2 | 46.4 | 48.9 | 47.6 | 55.7 | 55.0 | 53.2 | 66.2 | 57.5 | 49.0 | 55.4 | 99.8 | 68.1 |
10 | MATHMinos-Mistral-7B | PRM | Link | 54.2 | 48.8 | 54.0 | 51.4 | 57.0 | 52.1 | 50.7 | 57.8 | 54.4 | 52.8 | 55.8 | 91.1 | 66.5 |
11 | Llemma-PRM800k-7B | PRM | Link | 52.0 | 49.3 | 53.4 | 51.4 | 56.4 | 47.1 | 46.7 | 53.3 | 50.9 | 51.0 | 53.5 | 93.6 | 66.0 |
12 | Llemma-MetaMath-7B | PRM | Link | 50.5 | 50.2 | 50.5 | 50.3 | 51.9 | 47.6 | 44.4 | 52.1 | 49.0 | 50.5 | 51.3 | 96.0 | 66.0 |
13 | Llemma-oprm-7B | PRM | Link | 50.3 | 48.7 | 49.3 | 49.0 | 54.2 | 46.8 | 44.5 | 53.5 | 49.8 | 49.2 | 51.3 | 91.8 | 64.1 |
14 | MathShepherd-Mistral-7B | PRM | Link | 47.0 | 44.0 | 50.3 | 47.1 | 49.4 | 44.5 | 41.3 | 47.7 | 45.7 | 47.2 | 48.6 | 86.1 | 60.7 |
15 | Skywork-PRM-7B | PRM | Link | 36.2 | 35.7 | 41.2 | 38.4 | 36.7 | 29.1 | 30.6 | 34.4 | 32.7 | 36.8 | 37.4 | 88.8 | 54.3 |
16 | Skywork-PRM-1.5B | PRM | Link | 31.7 | 31.4 | 35.8 | 33.6 | 32.4 | 25.7 | 26.0 | 30.2 | 28.6 | 33.1 | 32.3 | 81.1 | 48.8 |
Task types: NR: Non-Redundancy, NCL: Non-Circular Logic, ES: Empirically Soundness, SC: Step Consistency, DC: Domain Consistency. CI: Confidence Invariance. PS: Prerequisite Sensitivity. DR: Deception Resistance. MS: Multi-Solution Consistency.
Evaluation Subjects: S1: Average PRMScore for Simplicity evaluation subjects, including NR. and NCL.
S2: Average PRMScore for Soundness evaluation subjects, including ES., SC., DC., and CI.
S3: Average PRMScore for Sensitivity evaluation subjects, including PS., DR., and MS.