1. How to curate the prompts (initial states / user input)
    1. Manual curation: MATH, GSM8K, Numina-Math, etc.
    2. Construction / Synthesis: MetaMath, MMIQC, KPMath, Xwin-Math, Jiuzhang3.0, OpenMathInstruct-2, etc.
  2. How to sample the responses (actions / model output)
    1. Meta-Generation: https://arxiv.org/abs/2406.16838
      1. Conditional context:
        1. Format / Content: Original problem+
          1. Special instructions: CoT, Least-to-Most, PaL/PoT, Tool-Integrated Reasoning, etc.
            1. ICL examples
            2. Reference documents: e.g., retrieved from math-related corpora
          2. Source of additional context
            1. Manually curated: Complexity-CoT, etc.
            2. Optimized: Prompt tuning, Automatic Prompt Engineering, etc.
            3. Adaptively retrieved (RAG)
      2. Budget allocation (or distribution re-weighting?): DART, etc.
      3. Algorithms:
        1. Tree-search: DFS, BFS, MCTS, etc.
        2. Graph-search: GoT, etc.
        3. Agent frameworks
        4. Parallel sampling: majority voting, self-consistency, reward-based selection
        5. Sequential sampling: [Self-]Correct, GLoRe, SCoRe, etc.
    2. Generation / Decoding: temperature, top-k, top-p, min-p, entropix, max new tokens, etc.
    3. Training-oriented generation:
      1. Reference guiding: STaR, GenRM, etc.
      2. Diversity / Deduplication: Deita, etc.
      3. Data selection: e.g., LESS, RM (Llama-3.1), Deita, etc.
  3. How to integrate feedbacks (environment dynamics)
    1. Tool calling
    2. Critics
  4. How to design the loss / assign weights to gradients for each token
    1. Prompt masking
    2. Feedback masking?
    3. Reward-weighted: REINFORCE, DPO, etc.
    4. Advantage-weighted: PPO, etc.
  5. Frequency to sample:
    1. Several responses per prompt across the whole prompt set: STaR, ReST-EM, Iterative DPO, etc.
    2. Prompt batch: PPO, etc.
Original Work Data Samples Loss / Gradient Weights Sampling frequency
STaR / ReST-EM
Iterative DPO
PPO