Kimi K2.7-Code: open-source coding model with better token efficiency
huggingface.co328 points by nekofneko 8 hours ago
328 points by nekofneko 8 hours ago
I just had Kimi K2.7-code rebase my Fil-C OpenSSH patch from 3.3.1 to 3.5.7 with quite bare bones instructions and it seems to have worked.
177KB patch, so it's not a small change. The patch did not apply cleanly initially; the agent had to do nontrivial work.
I just showed it the patch against 3.3.1, what command to use to build, and the path to 3.5.7 along with a link to the documentation of the change (https://fil-c.org/constant_time_crypto).
Note, I use my own coding agent (T800, which isn't public, and was previously well tested and tuned for K2.5).
I think this cost me between $5 and $10 in API usage.
Reading their modified license terms, it cracks me up, because they've basically remade the MIT to be the MIT + the one clause that the BSD used to have, which didn't care about MAU or revenue, if you used it in a product, they asked you to 'advertise' them basically. Honestly, its a reasonable request.
This is the cursor callout.
Don't make us shame you into disclosure
Cursor had a specific licensing agreement that allowed them to brand it how they want.
> Cursor had a specific licensing agreement...
Cursor had an "agreement" with Fireworks.ai, which apparently allowed them to RL train Composer 2 Kimi Base 2.5 without attribution: https://x.com/Kimi_Moonshot/status/2035074972943831491 / https://archive.vn/CcdkI
In fairness, Composer 2 performed differently on evals than Kimi's own coding models, claimed to be better than Claude Opus 4.6: https://x.com/fynnso/status/2034706304875602030 / https://archive.vn/bVtik. And, per Lee Robinson (Cursor employee), it is very likely Cursor builds its own foundational model for Composer 3.
Wasn't the end of that story that Cursor had a non-disclosure licence, so they had not done anything wrong towards Moonshot?
Ah is that what it is? I don't use Cursor, never saw it as being relevant to me, but would not surprise me.
Cursor's composer models are finetuned kimi
They are unusable (unless you want to deliberately destroy your codebase). So if Cursor's models are Kimi based, then well. I'll skip them altogether.
Kimi works great in their CLI, but their CLI has a number of workarounds for quirks of their models, including detecting when the model gets into a loop, and reverting to a checkpoint but letting the model compose a "message" to its past self (search their CLI for "BackToTheFuture"...) It doesn't work so well in a harness that doesn't take those quirks into account.
I'm using Composer extensively, and it works great for me. Your experiences are not universal.
I wouldn't skip at least testing the original. Model distilling done by Cursor could be the culprit.
They are far from unusable. They aork great for 80-90% of a typical full stack dev. Alot less useful for more noche stuff
Composer 1.x was poor. The new one is a totally different beast and absolutely fine for day to day.
They're not unusable, they're just bad when compared with all the real frontier models.
Shaming others when all AI is trained off scraped content and code huh? Many of those sources either breaking ToS or being illegal, such as Anna’s Archive. Bold move. And Chinese models in particular have been accused of distilling off American models.
Don’t you know there’s no honor among thieves?
It seems tacked on pretty quickly - I would have expected they try a little more legalese regarding what counts as a "user interface".
> they asked you to 'advertise' them basically.
To be clear, the “advertising” clause just requires you to disclose that you use the thing somewhere in the product, such as credits in an “About” section.
Personally, when I use open code or routers, I feel that beyond a certain level, the models don't make a huge difference to me. Except for expensive and mediocre models like Gemini. In that sense, Chinese models are pretty good. I usually write code in function or method units and then design and assemble them together.
GPT series models are more thorough and better, but I'm not sure if the difference is enormous. It seems to depend on the workflow, but in my opinion, if you are thorough enough, I wonder if there really is a big difference
I've kind of given up on the routers for "free" inference, as you would expect, they tend to give you sub-par thinking because they are obviously trying to conserve as much inference as possible.
I've had some success turning my macbook M1 pro into a heating pad with Qwen 3.6 35B A3B MTP. Trying to use Gemini models "locally" resulted in a similar "short shrift" of effort resulting in mistakes and lots of turns. The reports of Fable being relentlessly "proactive" shows you can go the other direction as well, if you have strong enough branding and effective invoicing.
> The reports of Fable being relentlessly "proactive"
For the curious: https://news.ycombinator.com/item?id=48498573 - “Claude Fable is relentlessly proactive”.
Tangent: did the MTP help you at all? I’ve tested that model back to back on my M1 Max MBP and the MTP version was actually marginally worse. I wonder if I didn’t use the right settings, although I tried several based on the obvious sources.
In my experience, there's little difference between implementing individual functions between frontier models and SotA ~30B param models.
Once you have a coherent design (the hard part), you can feed it to a pretty small model and get basically the same quality.
They'll not one-shot, but they're faster and cheaper, so it still works out in your favor.
Plus you can do it locally...
I have a similar experience. However, when including code review, I think the GPT model is the most impressive
The difference in outcome isn't that big but yes, you need to be more rigorous. For instance I've found that the Kimi K2.5 and K2.6 models will comment out failing tests rather than fix a problem they just caused (mistaking them for "pre-existing failures"), so you need to specifically make commented-out tests break the build. I've not personally had that problem with any of the Anthropic or OpenAI models.
I wonder why it's the natural tendency of models to BS or do stuff like this when they don't have the correct answer - it's clear that they can program refusal into them, but for some reason, refusal has to be injected after the fact, and models can't really arrive at the conclusion that they can't answer properly.
I assume it's a lack of care when RLing them.
RL has a tendency to reinforce cheating when the cheats are easier to find than the final solution.
So when making your RL environment, you need to spend a lot of effort on finding ways the model can cheat and penalizing them.
I really hope we stop using the term "Chinese models". It has this air of Negative connotation. It's the equivalent of calling cars Japanese, which people used to do but now is almost entirely meaningless. You just call them Toyota, Honda, Lexus etc.
I don't know, I tried using one of the Chinese models and it was VERY quick to scan my entire home dir, so maybe your threat surface is a little different than mine
Models can't scan anything.
They return instructions for you to do something, and you or a script you permit chooses to execute what the model tells you and return the result to the model.
For me, it has a positive connotation! In my experience, Chinese Model means cheaper, but still quite effective model you can use for millions of tokens without burning your entire wallet in seconds. That's why I get more excited over a Chinese model release over American models.
Japanese cars is actually a positive qualifier. I'd say anything Japanese motor-powered.
I don't think "Chinese" is pejorative in this context any more than "American" is. They are one of the two ecosystems. What's wrong with saying "Japanese cars" today?
> What's wrong with saying "Japanese cars" today?
Only that it’s a fairly meaningless grouping. When japan first entered the car market in north america there might have been some commonality, but now what characteristics do they share that some american cars don’t have? They’re not even imported a lot of the time.
Given that, it does start to feel tinged with racism if someone insists on grouping things together that don’t really belong together.
As for Chinese LLMs, the term doesn’t “feel” pejorative to me - but i also don’t see a totally clear set of attributes they share. Not all are open-weight. Some are small and can be run on consumer hardware, some are huge. They even have a variety of answers to what happened june 3rd 1989
> now what characteristics do they share that some american cars don’t have?
Typically the answer is "reliability", which is a positive trait, which makes the original callout about negative connotations very odd to me.
Chinese AI models also share a positive trait: they offer more bang for the buck.
Sadly there is a pejorative context. The constant us, the free world vs China, the evil Soviets rhetoric from every major news establishment and executive creates that negative view
On the other hand the Trump administration has successfully managed to make Chinese seem better than American, so there might not be that much of a pejorative context any more..
They are all funded and owned by the same entity, the CCP, so it probably would be better to call them CCP models.
Edit: Downvoting something doesn't make it false.
For those that don't like calling them CCP models, may I remind you, the CCP won't let Chinese AI researchers out of the country any more without securing approval first[1].
[1] https://www.tomshardware.com/tech-industry/artificial-intell...
I tend to agree with the comment in my reply thread about whether we really need to add biased modifiers to the essence of a good product. I think every national system in this world is flawed. And in this context, 'China or Chinese' is often used in a negative sense, like 'Made in China'. But KIMI is a good model, and I think the comment that pointed this out to me correctly identified my unconscious bias.
And even if the Chinese Communist Party provided funding, the result is still transparently released. So even if it is some kind of propaganda, I don't see what the problem is.
Is the monopolistic greed of American companies 'good', and China's greed 'bad'? I do have that question.
The question is not whether it is a good model, it is whether the model can be trusted to not act intentionally maliciously against certain topics or certain users.
We live in a time of a great geopolitical rivalry and high tensions with an emergent technology with tons of national security implications. To pretend otherwise is silly, and to fail to ask the question, dangerous.
Whether or not it's propaganda is different from the fact that it is owned by the CCP.
Doesn't matter, because they're open-weight, so I can just download them to my PC and... hey, look, now they're owned by me! Unlike the "good" Western counterparts which are all fully proprietary. (Except Mistral, but they're nowhere near SOTA.)
What is hidden in the weights matters.
Ah yes, those pesky Chinese backdoors that no single instance was ever found, even though Chinese open-weight model are a thing for many years now. Many people burn through millions of tokens on these models every day - surely someone would have triggered one of those backdoors, right?
Or that pesky CCP censorship and propaganda baked into the model, which any random guy can remove from whichever model they want as a single weekend side project with an off-the-shelf tool[1]. (Try it. It's fun. I've done it myself.)
I agree it is an empirical question. I do not know if that research has been done in the open sphere. But please, do not pretend that there isn't a real geopolitical rivalry going on that makes such questions a legitimate, non-fruity concern.