Claude Code: connect to a local model when your quota runs out
boxc.net317 points by fugu2 4 days ago
317 points by fugu2 4 days ago
> Reduce your expectations about speed and performance!
Wildly understating this part.
Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.
Yeah this is why I ended up getting Claude subscription in the first place.
I was using GLM on ZAI coding plan (jerry rigged Claude Code for $3/month), but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.
To clarify, the code I was getting before mostly worked, it was just a lot less pleasant to look at and work with. Might be a matter of taste, but I found it had a big impact on my morale and productivity.
> but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.
This is a very common sequence of events.
The frontier hosted models are so much better than everything else that it's not worth messing around with anything lesser if doing this professionally. The $20/month plans go a long way if context is managed carefully. For a professional developer or consultant, the $200/month plan is peanuts relative to compensation.
Until last week, you would've been right. Kimi K2.5 is absolutely competitive for coding.
Unless you include it in "frontier", but that has usually been used to refer to "Big 3".
I've been using MiniMax-M2.1 lately. Although benchmarks show it comparable with Kimi 2.5 and Sonnet 4.5, I find it more pleasant to use.
I still have to occasionally switch to Opus in Opencode planning mode, but not having to rely on Sonnet anymore makes my Claude subscription last much longer.
Looks like you need at least a quarter terabyte or so of ram to run that though?
(At todays ram prices upgrading to that for me would pay for a _lot_ of tokens...)
> Kimi K2.5 is absolutely competitive for coding.
Kimi K2.5 is good, but it's still behind the main models like Claude's offerings and GPT-5.2. Yes, I know what the benchmarks say, but the benchmarks for open weight models have been overpromising for a long time and Kimi K2.5 is no exception.
Kimi K2.5 is also not something you can easily run locally without investing $5-10K or more. There are hosted options you can pay for, but like the parent commenter observed: By the time you're pinching pennies on LLM costs, what are you even achieving? I could see how it could make sense for students or people who aren't doing this professionally, but anyone doing this professionally really should skip straight to the best models available.
Unless you're billing hourly and looking for excuses to generate more work I guess?
I disagree, based on having used it extensively over the last week. I find it to be at least as strong as Sonnet 4.5 and 5.2-Codex on the majority of tasks, often better. Note that even among the big 3, each of them has a domain where they're better than the other two. It's not better than Codex (x-)high at debugging non-UI code - but neither is Opus or Gemini. It's not better than Gemini at UI design - but neither is Opus or Codex. It's not better than Opus at tool usage and delegation - but neither is Gemini or Codex.
Yeah Kimi-K2.5 is the first open weights model that actually feels competitive with the closed models, and I've tried a lot of them now.
For many companies. They’d be better to pay $200/month and layoff 1% of the workforce to pay for it.
My very first tests of local Qwen-coder-next yesterday found it quite capable of acceptably improving Python functions when given clear objectives.
I'm not looking for a vibe coding "one-shot" full project model. I'm not looking to replace GPT 5.2 or Opus 4.5. But having a local instance running some Ralph loop overnight on a specific aspect for the price of electricity is alluring.
Similar experience to me. I tend to let glm-4.7 have a go at the problem then if it keeps having to try I'll switch to Sonnet or Opus to solve it. Glm is good for the low hanging fruit and planning
Same. I messed around with a bunch of local models on a box with 128GB of VRAM and the code quality was always meh. Local AI is a fun hobby though. But if you want to just get stuff done it’s not the way to go.
Did you eventually move to a $20/mo Claude plan, $100/mo Claude plan, $200/mo, or API based? if API based, how much are you averaging a month?
The $20 one, but it's hobby use for me, would probably need the $200 one if I was full time. Ran into the 5 hour limit in like 30 minutes the other day.
I've also been testing OpenClaw. It burned 8M tokens during my half hour of testing, which would have been like $50 with Opus on the API. (Which is why everyone was using it with the sub, until Anthropic apparently banned that.)
I was using GLM on Cerebras instead, so it was only $10 per half hour ;) Tried to get their Coding plan ("unlimited" for $50/mo) but sold out...
(My fallback is I got a whole year of GLM from ZAI for $20 for the year, it's just a bit too slow for interactive use.)
I now have 3 x 100 plans. Only then I an able to full time use it. Otherwise I hit the limits. I am q heavy user. Often work on 5 apps at the same time.
Shouldn't the 200 plan give you 4x?? Why 3 x 100 then?
Good point. Need to look into that one. Pricing is also changing constantly with Claude
The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.
> The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago
Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.
Also the proprietary models a year ago were not that good for anything beyond basic tasks.
Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…
Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
> Are there a lot of options how "how far" do you quantize?
So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
It’s a trivial calculation to make (+/- 10%).
Number of params == “variables” in memory
VRAM footprint ~= number of params * size of a param
A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.
And that's at unusable speeds - it takes about triple that amount to run it decently fast at int4.
Now as the other replies say, you should very likely run a quantized version anyway.
Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)
https://buildai.substack.com/i/181542049/the-mac-mini-moment
I did not expect this to be a limiting factor in the mac mini RDMA setup ! -
> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.
Thermal throttling of network cables is a new thing to me…
"Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.
[flagged]
70B dense models are way behind SOTA. Even the aforementioned Kimi 2.5 has fewer active parameters than that, and then quantized at int4. We're at a point where some near-frontier models may run out of the box on Mac Mini-grade hardware, with perhaps no real need to even upgrade to the Mac Studio.
>may
I'm completely over these hypotheticals and 'testing grade'.
I know Nvidia VRAM works, not some marketing about 'integrated ram'. Heck look at /r/locallama/ There is a reason its entirely Nvidia.
> Heck look at /r/locallama/ There is a reason its entirely Nvidia.
That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:
- https://www.reddit.com/r/LocalLLaMA/comments/1qw15gl/comment...
- https://www.reddit.com/r/LocalLLaMA/comments/1qw0ogw/analysi...
- https://www.reddit.com/r/LocalLLaMA/comments/1qvwi21/need_he...
- https://www.reddit.com/r/LocalLLaMA/comments/1qvvf8y/demysti...
I specifically mentioned "hypotheticals and 'testing grade'."
Then you sent over links describing such.
In real world use, Nvidia is probably over 90%.
Mmmm, not really. I have both a4x 3090 box and a Mac m1 with 64 gb. I find that the Mac performs about the same as a 2x 3090. That’s nothing stellar, but you can run 70b models at decent quants with moderate context windows. Definitely useful for a lot of stuff.
>quants
>moderate context windows
Really had to modify the problem to make it seem equal? Not that quants are that bad, but the context windows thing is the difference between useful and not useful.
Are you an NVIDIA fanboy?
This is a _remarkably_ aggressive comment!
Not at all. I don't even know why someone would be incentivized by promoting Nvidia outside of holding large amounts of stock. Although, I did stick my neck out suggesting we buy A6000s after the Apple M series didn't work. To 0 people's surprise, the 2xA6000s did work.
Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.
It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.
I wonder if the "distributed AI computing" touted by some of the new crypto projects [0] works and is relatively cheaper.
Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.
Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.
For sure Claude Code isn’t profitable
Neither was Uber and … and …
Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.