- How to curate the prompts (initial states / user input)
- Manual curation: MATH, GSM8K, Numina-Math, etc.
- Construction / Synthesis: MetaMath, MMIQC, KPMath, Xwin-Math, Jiuzhang3.0, OpenMathInstruct-2, etc.
- How to sample the responses (actions / model output)
- Meta-Generation: https://arxiv.org/abs/2406.16838
- Conditional context:
- Format / Content: Original problem+
- Special instructions: CoT, Least-to-Most, PaL/PoT, Tool-Integrated Reasoning, etc.
- ICL examples
- Reference documents: e.g., retrieved from math-related corpora
- Source of additional context
- Manually curated: Complexity-CoT, etc.
- Optimized: Prompt tuning, Automatic Prompt Engineering, etc.
- Adaptively retrieved (RAG)
- Budget allocation (or distribution re-weighting?): DART, etc.
- Algorithms:
- Tree-search: DFS, BFS, MCTS, etc.
- Graph-search: GoT, etc.
- Agent frameworks
- Parallel sampling: majority voting, self-consistency, reward-based selection
- Sequential sampling: [Self-]Correct, GLoRe, SCoRe, etc.
- Generation / Decoding: temperature, top-k, top-p, min-p, entropix, max new tokens, etc.
- Training-oriented generation:
- Reference guiding: STaR, GenRM, etc.
- Diversity / Deduplication: Deita, etc.
- Data selection: e.g., LESS, RM (Llama-3.1), Deita, etc.
- How to integrate feedbacks (environment dynamics)
- Tool calling
- Critics
- How to design the loss / assign weights to gradients for each token
- Prompt masking
- Feedback masking?
- Reward-weighted: REINFORCE, DPO, etc.
- Advantage-weighted: PPO, etc.
- Frequency to sample:
- Several responses per prompt across the whole prompt set: STaR, ReST-EM, Iterative DPO, etc.
- Prompt batch: PPO, etc.
Original Work |
Data Samples |
Loss / Gradient Weights |
Sampling frequency |
STaR / ReST-EM |
|
|
|
Iterative DPO |
|
|
|
PPO |
|
|
|