Pass@1024 of DeepSeekMath-7B models from DART-Math

accuracy of different DeepSeekMath (DSMath) models and temperatures ($t$) on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.

pass@k accuracy of different DeepSeekMath (DSMath) models and temperatures ($t$) on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.

Pass@128/256 of Mistral-7B & DeepSeekMath-7B

Untitled

adapter explanation: “adapter” for base text completion model on MATH500 as a QA task

Few-Shot: few-shot ICL; here we use 8 shots following MAmmoTH.
Instruct: SFT and test with 0-shot; here we use SFT models from HF Hub listed below.
RL: RL and test with 0-shot; here we use RL models from HF Hub listed below.

model_seqs = [ # Adapter / Alignment method
    [ # DeepSeekMath-7B
        "deepseek-ai/deepseek-math-7b-base", # Few-shot (8-shot)
        "deepseek-ai/deepseek-math-7b-instruct", # Instruct (SFT, 0-shot)
        "deepseek-ai/deepseek-math-7b-rl", # RL (0-shot)
    ],
    [ # Mistral-7B
        "mistralai/Mistral-7B-v0.1", # Few-shot (8-shot)
        "peiyi9979/mistral-7b-sft", # Instruct (SFT, 0-shot)
        "peiyi9979/math-shepherd-mistral-7b-rl", # RL (0-shot)
    ],
]

ensemble_type explanation:

avg: mean value of pass ratios (#correct samples/#all samples for the same problem) on all test samples (similar to Pass@1). Each query has a float pass ratio in $[0,1]$.
maj: pass rates by majority voting. Each query is counted as pass if the majority-voted answer of the sampled answers is correct.
any: common Pass@k. Each query is counted as pass if any of the sampled answers is correct.

Cost of large-scale sampling

<aside> 🔑 Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)

</aside>

Dataset: MATH500

Framework: vLLM