Related Works

Papers

Listed in reverse chronological order:

Let's Verify Step by Step (Lightman et al.@OpenAI, 2023): trained the strongest PRM with the most human annotations (PRM800K).
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training (Wu et al.@UWxAI2, 2023): proposed fined-grained RLHF in textual tasks, whose rewards are fine-grained in 2 dimensions: 1) granularities (sub-sentence, sentence, paragraph) and 2) error types (irrelavance, false facts, incomplete information, etc.). I think the biggest problem with the approach is that these fine-grained divisions are usually task-dependent and don't generalize easily.
Solving math word problems with process- and outcome-based feedback (Uesato et al.@DeepMind, 2022): is the first comprehensive comparison (and perhaps the most so far) between all kinds of process- and outcome-based approaches, but just doesn't work very well.
Making Large Language Models Better Reasoners with Step-Aware Verifier (Li et al.@SJTUxMSRA, 2022): proposed an interesting heuristic in replacement of human annotations based on string matching consistency among the intermediate calculation results, but seems to not generalize easily, too.
Training Verifiers to Solve Math Word Problems (Cobbe et al.@OpenAI, 2021): although without process supervision signals, fully explored various factors and usage of the verifier.

Supervise Process, not Outcomes (A. Stuhlmüller and J. Byun@AI Alignment Forum (a BBS), 2022): emphasized that process supervision makes it possible to train for difficult tasks (e.g., scientific research) for which results cannot be obtained, and the advantages of process supervision in terms of interpretability and safety.

Sources labeled in parentheses below indicate that the work(s) involved research on the corresponding aspect for which reference is made.

If not, it may be involved by all the works, or the author's point of view.

Outcome Supervision: whether the final outcome is correct or not
Process Supervision:
- whether the single step itself is correct or not (Lightman et al., 2023; Wu et al., 2023)
- whether the solution so far is correct or not (Uesato et al., 2022; Li et al., 2022)