Dataset: MATH500 (following Let’s Verify Step by Step)

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSmxOV0p2TldRd1YyWnNTSGhVUW1kelUyUkVPQ0lzSW5CaFoyVkpaQ0k2SWpWak5EUTBNelpoTW1Oa05qUXpZak00TVdVM05EUXlOMlUzWmpkaU1UUm1JbjA9

Pass@1024 of DeepSeekMath-7B models from DART-Math

 accuracy of different DeepSeekMath (DSMath) models and temperatures ($t$) on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.

pass@k accuracy of different DeepSeekMath (DSMath) models and temperatures ($t$) on MATH500 (Lightman et al., 2024), a subset of MATH test set. With enough trials, models are actually able to sample out answer-correct responses to most (>99%) queries.

Pass@128/256 of Mistral-7B & DeepSeekMath-7B

Untitled

adapter explanation: “adapter” for base text completion model on MATH500 as a QA task

model_seqs = [ # Adapter / Alignment method
    [ # DeepSeekMath-7B
        "deepseek-ai/deepseek-math-7b-base", # Few-shot (8-shot)
        "deepseek-ai/deepseek-math-7b-instruct", # Instruct (SFT, 0-shot)
        "deepseek-ai/deepseek-math-7b-rl", # RL (0-shot)
    ],
    [ # Mistral-7B
        "mistralai/Mistral-7B-v0.1", # Few-shot (8-shot)
        "peiyi9979/mistral-7b-sft", # Instruct (SFT, 0-shot)
        "peiyi9979/math-shepherd-mistral-7b-rl", # RL (0-shot)
    ],
]

ensemble_type explanation:

Cost of large-scale sampling

<aside> 🔑 Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)

</aside>

Dataset: MATH500

Framework: vLLM