본문 바로가기

U.S stocks [2025] ISSUE arrangemet

Is DeepSeek-R1 really a cost-effective model?

반응형

<Is DeepSeek-R1 really a cost-effective model?>

Now it doesn't really matter what the truth is, although DeepSeek seems to have already become a successful model... I'm summarizing and writing down what I think. Since the announcement of DeepSeek-R1, Nvidia's stock price has plummeted by nearly 20%, and other big tech companies' stock prices have plummeted one after another, so you can now create a good model without putting a lot of GPUs and computing power ahead. I think that's why anyone can build a foundation model that was unique to big technologies… Let me point it out one by one (there may be some wrong content)

1. First of all, it is necessary to clarify whether the word "good cost-effectiveness" is cost-effectiveness in training, reasoning, or cost-effectiveness in services. In summary, the fact that DeepSeek learned $55.76 million (about 8 billion won) with H800 GPU (Figure 1) is drawing attention, and that API costs are less than 1/10 compared to o1 (Figure 2).

2. Let's start with learning. We can think about it in two ways. The first model, which cost only 8 billion won (?) to learn, is DeepSeek-V3, not DeepSeek-R1. So how much did it cost to learn? I don't know that, but R1 starts on the premise that there is DeepSeek-V3. So, first of all, there is a cost of learning V3 and then there is an additional cost to it. If you look at the technical report on how you learned R1, 1) first of all, take V3 and learn how to learn through reinforcement learning called GPRO. This model is DeepSeek-R1-Zero. 2) Now, with DeepSeek-R1-Zero, we generate Chain of Thought (CoT) data to learn how to learn how to listen, and then people stick on it and process it hard. And using this data, we use supervised fine-tuning (SFT) to learn the DeepSeek-V3-base model (cold start 3) The model learned in this way is again learned to learn well through reinforcement learning in the same way as we did in 1). 4) We generate the reference data (60k) using the checkpoint we used to learn from 3) and use the SFT data we used to learn DeepSeek-V3 to create non-reasoning data (20k) and then we learn with SFT. 5) Finally, we use reinforcement learning to improve helpfulness and hamelessness, and refine our reasoning ability. Isn't it complicated? After 8 billion won of learning with H80, we need to learn this many more processes to create DeepSeek-R1. Is this really cost-effective learning? I can't say right or wrong because it's a relative comparison, but I can see that it cost a lot more than people say.

3.Nevertheless, last year's DeepSeek-V3 is a good enough model. So you might think, "Isn't it good to learn at least a little cost?" But I also think we should think about this again because just because it costs less to learn one time doesn't mean that it costs less to actually create this model. If it's learned well at once, it could cost 8 billion, but for good results, of course, it takes a lot of trial and error and a process of finding the best learning recipe. No one knows how much GPU and computing power were used in this process. There's actually 50,000 H100s in DeepSeek... I think there was probably a process of learning at a huge cost. But I think it's a great achievement to reduce the cost of learning one time. What I'm worried about is that now more people think that we can make this with 8 billion won.

4. Now let's move on to inference. In fact, it's not easy to judge how much an inference service actually costs by API prices. This is because each company has a different percentage of profit, and some companies can provide services for market share even at the expense of losses. In fact, if you go to together.ai , we currently offer various open models, and as you can see in Figure 3, DeepSeek-R1 is very expensive. It's twice as expensive as the LaMA 405B model. If a company is serving multiple models, wouldn't it be reasonable to think that each model is more likely to make a profit similar to the actual cost?

5. One of the most well-known reasons DeepSeek-R1 can serve cheaply is that it has a Mixture of Expert (MoE) structure. The DeepSeek model has a parameter of 671B, of which only 37B is active in 1 reference. However, if you receive various requests from the cloud and process them at the same time, you should have the entire 671B divided into multiple GPUs and have them all. This is because each request has different experts. Moreover, as experts are scattered across multiple GPUs, there are more tasks to exchange data between GPUs. Therefore, the MoE structure itself has a relatively inefficient structure for inference. And for this reason, the relative inference speed (tokens/sec) is also very slow. (Figure 4) shows that R1 is more than 40% slower than o1.

320x100