Scalable non-distillation methods (to improve the most frontier models / push forward the upper limit)

How to curate the prompts (initial states / user input)
1. Manual curation: MATH, GSM8K, Numina-Math, etc.
2. Construction / Synthesis: MetaMath, MMIQC, KPMath, Xwin-Math, Jiuzhang3.0, OpenMathInstruct-2, etc.
How to sample the responses (actions / model output)
1. Meta-Generation: https://arxiv.org/abs/2406.16838
  1. Conditional context:
    1. Format / Content: Original problem+
      1. Special instructions: CoT, Least-to-Most, PaL/PoT, Tool-Integrated Reasoning, etc.
        
        ICL examples
        
        Reference documents: e.g., retrieved from math-related corpora
      2. Source of additional context
        
        Manually curated: Complexity-CoT, etc.
        
        Optimized: Prompt tuning, Automatic Prompt Engineering, etc.
        
        Adaptively retrieved (RAG)
  2. Budget allocation (or distribution re-weighting?): DART, etc.
  3. Algorithms:
    1. Tree-search: DFS, BFS, MCTS, etc.
    2. Graph-search: GoT, etc.
    3. Agent frameworks
    4. Parallel sampling: majority voting, self-consistency, reward-based selection
    5. Sequential sampling: [Self-]Correct, GLoRe, SCoRe, etc.
2. Generation / Decoding: temperature, top-k, top-p, min-p, entropix, max new tokens, etc.
3. Training-oriented generation:
  1. Reference guiding: STaR, GenRM, etc.
  2. Diversity / Deduplication: Deita, etc.
  3. Data selection: e.g., LESS, RM (Llama-3.1), Deita, etc.
How to integrate feedbacks (environment dynamics)
1. Tool calling
2. Critics
How to design the loss / assign weights to gradients for each token
1. Prompt masking
2. Feedback masking?
3. Reward-weighted: REINFORCE, DPO, etc.
4. Advantage-weighted: PPO, etc.
Frequency to sample:
1. Several responses per prompt across the whole prompt set: STaR, ReST-EM, Iterative DPO, etc.
2. Prompt batch: PPO, etc.

Original Work	Data Samples	Loss / Gradient Weights	Sampling frequency
STaR / ReST-EM
Iterative DPO
PPO