Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model
moonshotai.github.io923 points by nekofneko 4 days ago
923 points by nekofneko 4 days ago
As a Chinese user, I can say that many people use Kimi, even though I personally don’t use it much. China’s open-source strategy has many significant effects—not only because it aligns with the spirit of open source. For domestic Chinese companies, it also prevents startups from making reckless investments to develop mediocre models. Instead, everyone is pushed to start from a relatively high baseline. Of course, many small companies in the U.S., Japan, and Europe are also building on Qwen. Kimi is similar: before DeepSeek and others emerged, their model quality was pretty bad. Once the open-source strategy was set, these companies had no choice but to adjust their product lines and development approaches to improve their models.
Moreover, the ultimate competition between models will eventually become a competition over energy. China’s open-source models have major advantages in energy consumption, and China itself has a huge advantage in energy resources. They may not necessarily outperform the U.S., but they probably won’t fall too far behind either.
One thing to add: the most popular product in china on AI is not kimi i think' it shoud be DOUBAO by bytedance(tiktok owner) and yuanbao by tencent. The have a better UI and feature set and you can also select deepseek model from it. Kimi still has a lot of users but I think in the long term it still may not doing well. So its still a win for closed model?
There’s a lot of indications that we’re currently brute forcing these models. There’s honestly not a reason they have to be 1T parameters and cost an insane amount to train and run on inference.
What we’re going to see is as energy becomes a problem; they’ll simply shift to more effective and efficient architectures on both physical hardware and model design. I suspect they can also simply charge more for the service, which reduces usage for senseless applications.
There are also elements of stock price hype and geopolitical competition involved. The major U.S. tech giants are all tied to the same bandwagon — they have to maintain this cycle: buy chips → build data centers → release new models → buy more chips.
It might only stop once the electricity problem becomes truly unsustainable. Of course, I don’t fully understand the specific situation in the U.S., but I even feel that one day they might flee the U.S. altogether and move to the Middle East to secure resources.
> There’s honestly not a reason they have to be 1T parameters and cost an insane amount to train and run on inference.
Kimi K2 Thinking is rumored to have cost $4.6m to train - according to "a source familiar with the matter": https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-rele...
I think the most interesting recent Chinese model may be MiniMax M2, which is just 200B parameters but benchmarks close to Sonnet 4, at least for coding. That's small enough to run well on ~$5,000 of hardware, as opposed to the 1T models which require vastly more expensive machines.
That number is as real as the 5.5 million to train DeepSeek. Maybe it's real if you're only counting the literal final training run, but total costs including the huge number of failed runs all other costs accounted for, it's several hundred million to train a model that's usually still worse than Claude, Gemini, or ChatGPT. It took 1B+ (500 billion on energy and chips ALONE) for Grok to get into the "big 4".
Using such theory, one can even argue that the real cost needs to include the infrastructures, like total investment into the semiconductor industry, the national electricity grid, education and even defence etc.
Correct! You do have to account for all of these things! Unironically correct! :)
> That's small enough to run well on ~$5,000 of hardware...
Honestly curious where you got this number. Unless you're talking about extremely small quants. Even just a Q4 quant gguf is ~130GB. Am I missing out on a relatively cheap way to run models well that are this large?
I suppose you might be referring to a Mac Studio, but (while I don't have one to be a primary source of information) it seems like there is some argument to be made on whether they run models "well"?
Yes, I mean a Mac Studio with MLX.
An M3 Ultra with 256GB of RAM is $5599. That should just about be enough to fit MiniMax M2 at 8bit for MLX: https://huggingface.co/mlx-community/MiniMax-M2-8bit
Or maybe run a smaller quantized one to leave more memory for other apps!
Here are performance numbers for the 4bit MLX one: https://x.com/ivanfioravanti/status/1983590151910781298 - 30+ tokens per second.
It’s kinda misleading to omit the generally terrible prompt processing speed on Macs
30 tokens per second looks good until you have to wait minutes for the first token
Running in cpu ram works fine. It’s not hard to build a machine with a terabyte of RAM.
Admittedly I've not tried running on system RAM often, but every time I've tried it's been abysmally slow (< 1 T/s) when I've tried on something like KoboldCPP or ollama. Is there any particular method required to run them faster? Or is it just "get faster RAM"? I fully admit my DDR3 system has quite slow RAM...
i assume that $4.6 mil is just the cost of the electricity?
Hard to be sure because the source of that information isn't known, but generally when people talk about training costs like this they include more than just the electricity but exclude staffing costs.
Other reported training costs tend to include rental of the cloud hardware (or equivalent if the hardware is owned by the company), e.g. NVIDIA H100s are sometimes priced out in cost-per-hour.
Citation needed on "generally when people talk about training costs like this they include more than just the electricity but exclude staffing costs".
It would be simply wrong to exclude the staffing costs. When each engineer costs well over 1 million USD in total costs year over year, you sure as hell account for them.
If you have 1,000 researchers working for your company and you constantly have dozens of different training runs in the go, overlapping each other, how would you split those salaries between those different runs?
Calculating the cost in terms of GPU-hours is a whole lot easier from an accounting perspective.
The papers I've seen that talk about training cost all do it in terms of GPU hours. The gpt-oss model card said 2.1 million H100-hours for gpt-oss:120b. The Llama 2 paper said 3.31M GPU-hours on A100-80G. They rarely give actual dollar costs and I've never seen any of them include staffing hours.
Do they include the costs of dead-end runs?
No, they don't! That's why the "5.5 million" deepseek V3 number as read by American investors was total bullshit (because investors ignored their astrik saying "only final training run")