Yuxuan Tong tongyuxuan361@gmail.com / tongyx21@mails.tsinghua.edu.cn

All content are mainly based on (publication) first-hand resources: OpenAI posts, works by contributors, public evaluations, reproduction works with similar performance, …

DeepSeek-R1-Lite-Preview Hidden CoT Case Studies

Questions

Why do two points of o1/o1-mini show about-twice relationship in inference cost, while o1-preview quite similar?

https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

OpenAI posts

How to learn self-refinement through RL

Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. #

Meta-abilities

Error detection
Error correction
Problem decomposition
Diverse ideas
Retrying

It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. # … The models use these reasoning tokens to "think", breaking down their understanding of the prompt and considering multiple approaches to generating a response. After generating reasoning tokens, the model produces an answer as visible completion tokens, and discards the reasoning tokens from its context. (multi-step conversation between a user and an assistant) #

Exploration

o1-mini can explore more thought chains compared to o1-preview #