A perfect open revision of Dreepesek-R1

If you struggle with a tough math problem, you know how useful it is to think less and work hard. OPENI OPENI It is shown that if the LLMS are trained to do the same-by using more complementary drinking period – it is better to solve reasoning tasks such as mathematics, coding.
However, the recipe behind the OpenII’s reasoning models is well kept. In other words, until last week, when Dereseehek released their DEPSEEK-R1 model and quickly broke the internet (and the Stock Market!).
Except to conduct as well as or better than O1, the DEPSEEK-R1 The release is accompanied by a detailed Techop Report which plotted the main steps of their training recipe. This recipe has involved many changes, especially the application of Pure Fure Reinp Freaterement Learnement to teach a model language language how to argue any manage the person. As shown in the number below, making a powerful logging model now is simply as well as you have access to a competent base model and a high-quality mixed data:
However the deepeseek-r1 releases of the leaves opens multiple questions about:
- Data Collection: How are the cortats in the rational specific datasics?
- Model training: Deresteek has not been found to be issued by Deresteek, so don’t know what hyperparameters are the best and how different models are different.
- Laws of scale: What are computes and data trade-off of training rational models?
These questions prompt us to launch the Open-r1 projectAn initiative of systemic rebuilding Dreeseek-R1 data and training training, validating its claims, and pushed the boundaries of nod models. By building Open-R1, we intend to give transparency how to strengthen learning can develop reason, share the changes of oporating models in the coming of these ways.
In this blog post we looked at the key components behind Dereseek-R1, which parts we planned to be different, and how to contribute to the Open-R1 project.
Let’s go away from π!
How did they make it?
Depseek-R1 is a model of reasoning built in the foundation of DEPSEEK-V3. Like any good modeling model, it starts with a strong base model, and Dereseek-V3 is exactly that. This 671B mixed experts (Moe) model prompts PAM of heavyweights such as Sonnet 3.5 and GPT-4o. What is more impressive is how to fix train cost-5.5m – thanks for architectural changes
Deresterek also introduces two models: depermin-r1-zero and derelesek-r1, each with a different approach to training. Deepseek-R1-zero walks in charge of fine tuning and confidence fully to the policy of strengthening (GRO) to make the process more efficient. A simple reciprocal system is used to guide the model, provides feedback based on accuracy and structure of its answers. This method has helped the model to develop useful basic skills, such as breaking problems with steps and verifying personal outputs. However, its answers often learn and difficult to read.
Where the deepseek-r1 comes in, it passes a lot of rl measures and refuses to recognize low-quality human preference outputs, to make a model not only cause and consistent answers.
It’s all good, but what’s missing? Let’s see the missing pieces of the puzzle.
Open-R1: The lost blind
The release of Deepsheek-R1 is a remarkable boon for the community, but they do not release all-Although model weights are open, dating-dates and code used in model training is not π’.
The purpose of Open-R1 is to build the last lost pieces so that the entire research community and industry can establish similar or better models using recipes and datas. And by doing this in the open, all the people of the community can contribute!
As shown in the number below, here is our attack plan:
- Step 1: Replicate r1-distill models by simplifying a high quality rational dataset from Dereseek-R1.
- Step 2: Replicate the Pure RL Pipeline used by Deepseeek to create R1-zero. This includes the end of new, big math, rationalization dates, and code.
- Step 3: Show that we can go from base model β SFT β RL by training in multi-stage.
Dittims will be allowed all that will heal with healing or new rational models by simply tuning them. Training recipes involving RL will serve as a starting point for anyone to build similar models from withdrawal and allow researchers to establish more surface advocates.
Note that we don’t want to give up on math datas. There is a lot of potential to explore other places, clearly like the code but also the scientific fields such as medicine, which models of reasoning can have a significant effect.
This initiative is not just about the exploration of the consequences – it is about sharing community views. By documenting what works, what is not, and why, we hope to save others from waste time and compute unproductive paths.
If it sounds interesting, we love your help! If contributing it CODEparticipate in discussions with Dealing with the faceThere are many ways to get involved. Let’s build it together! π
https://huggingface.co/blog/assets/open-r1/thumbnails.png
2025-01-28 09:40:00