Yuxuan Tong [email protected] / [email protected]

All content are mainly based on (publication) first-hand resources: OpenAI posts, works by contributors, public evaluations, reproduction works with similar performance, …

o1 hidden CoT case studies

Questions

Why do two points of o1/o1-mini show about-twice relationship in inference cost, while o1-preview quite similar?

https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

OpenAI posts

How to learn self-refinement through RL

Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. #

Meta-abilities

It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. # … The models use these reasoning tokens to "think", breaking down their understanding of the prompt and considering multiple approaches to generating a response. After generating reasoning tokens, the model produces an answer as visible completion tokens, and discards the reasoning tokens from its context. (multi-step conversation between a user and an assistant) #

Exploration

o1-mini can explore more thought chains compared to o1-preview #

Coding specialization (o1-ioi)