Dataset: MATH500 (following Let’s Verify Step by Step)
https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSmxOV0p2TldRd1YyWnNTSGhVUW1kelUyUkVPQ0lzSW5CaFoyVkpaQ0k2SWpWak5EUTBNelpoTW1Oa05qUXpZak00TVdVM05EUXlOMlUzWmpkaU1UUm1JbjA9
pass@k
accuracy of different DeepSeekMath (DSMath) models and temperatures ($t$) on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.
adapter
explanation: “adapter” for base text completion model on MATH500 as a QA task
model_seqs = [ # Adapter / Alignment method
[ # DeepSeekMath-7B
"deepseek-ai/deepseek-math-7b-base", # Few-shot (8-shot)
"deepseek-ai/deepseek-math-7b-instruct", # Instruct (SFT, 0-shot)
"deepseek-ai/deepseek-math-7b-rl", # RL (0-shot)
],
[ # Mistral-7B
"mistralai/Mistral-7B-v0.1", # Few-shot (8-shot)
"peiyi9979/mistral-7b-sft", # Instruct (SFT, 0-shot)
"peiyi9979/math-shepherd-mistral-7b-rl", # RL (0-shot)
],
]
ensemble_type
explanation:
avg
: mean value of pass ratios (#correct samples/#all samples for the same problem) on all test samples (similar to Pass@1). Each query has a float pass ratio in $[0,1]$.maj
: pass rates by majority voting. Each query is counted as pass if the majority-voted answer of the sampled answers is correct.any
: common Pass@k. Each query is counted as pass if any of the sampled answers is correct.<aside> 🔑 Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)
</aside>
Dataset: MATH500
Framework: vLLM