A few words on DS4

182 points by caust1c 5 hours ago

DwarfStar4 is a small LLM inference runtime that can run DeepSeek 4. The blog post implies that it currently requires 96GB of VRAM.

For others who are lacking context :-)

foresto - an hour ago

Thanks. Outside of LLM circles, DS4 is usually a video game controller.
- artyom - 38 minutes ago
  
  Well, I was sitting here expecting the Redis creator have an opinion on still-unannounced Dark Souls 4.
- jofzar - 26 minutes ago
  
  I am actually kind of disappointed it wasn't a deep dive on the dual shock 4

karmakaze - 2 hours ago

Great to find this narrow focused thing:

> We support the following backends:

    Metal is our primary target. Starting from MacBooks with 96GB of RAM.
    NVIDIA CUDA with special care for the DGX Spark.
    AMD ROCm is only supported in the rocm branch. It is kept separate from main
    since I (antirez) don't have direct hardware access, so the community rebases
    the branch as needed.

> This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

Edit: aww, doesn't seem to support offloading to system RAM[0] (yet)

[0] https://github.com/antirez/ds4/issues/108

Guess I'll have to keep watching the llama.cpp issue[1]

[1] https://github.com/ggml-org/llama.cpp/issues/22319

zmmmmm - 37 minutes ago

I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.

Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.

jofzar - 22 minutes ago

> I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too.
It's always going to be cost;
developer time vs developer cost vs AI cost vs developer productivity.
With 4.6 it's looking like we are at the upper limit of appetite for cost (for "regular" Business) so the other levers will probably need to change.

somewhatrandom9 - 2 hours ago

With "intelligence" (or whatever you want to call it) and speed both seeming to ramp up quickly with local models I wonder what the growth rate and ceiling(?) might be in this space. Will this kind of iq and performance work with just e.g: 16GB RAM in a couple years? Is there a new kind of Moore's law to be defined here?

lwansbrough - an hour ago

The people working at the leading edge of this stuff seem to believe that there is a need for parallel models that solve different problems.
A crow exhibits some degree of intelligence in what is a very small brain compared to humans. There is overlap in the problem solving skills of the dumbest humans and the smartest crows.
So the question is: what is that? Yann LeCun seems to think it’s what we now call world models. World models predict behaviour as opposed to predicting structured data (like language.)
If your model can predict how some world works (how you define world largely depends on the size of your training data), then in theory it is able to reason about cause and effect.
If you can combine cause and effect reasoning with language, you might get something truly intelligent.
That’s where things seem to be going. Once we have a prototype of that system, there will be many questions about how much data you really need. We’ve seen how even shrinking LLMs with 1-bit quantization can lead to models that exhibit a fairly strong understanding of language.
I don’t think it’s unreasonable to expect to see some very intelligent low (relatively) memory AI systems in the next couple years.

FuckButtons - 2 hours ago

It’s shocking how close this feels to claude, obviously it's much slower, but I don’t know that it’s significantly dumber. Interestingly the imatrix quantization seems to be better than whatever quant the zdr inference backends on open router are using. It was self aware enough yesterday to realize that it’s own server process was itself without me telling it, which is not something I’ve ever observed a local model doing before.

stavros - 2 hours ago

In my (obviously anecdotal) testing, DeepseekV4 Pro was better than Sonnet at coding. However, it is much slower, but also many times cheaper, especially with the promotion right now.

0xbadcafebee - 3 hours ago

I don't see an explanation of why they would make a model-specific inference engine vs just using llamacpp. There are already lots of people working on the llamacpp integration. This is a lot of effort spent on a single model which is likely to become obsolete when a different model comes out that does better. In some discussions, people are now making PRs against both the llamacpp branches and ds4... so it's taking a rare commodity (people investing development time in this model) and fragmenting it

zozbot234 - 3 hours ago

Author has mentioned many times that the llama.cpp maintainers don't want code that's prevalently written by AI with no human revision. If anyone wants to try and get the support upstreamed into that project, they're quite free to do that: the code is MIT licensed.
- kristianp - 2 hours ago
  
  Also Antirez has been able to use GPT to iterate on the code and performance. He/they (others contributed to DS4) has a set of result files to ensure that correctness is maintained, and benchmarks to verify performance, and the LLM is able to iterate within that framework. Having a small, focussed codebase helps here.
  Antirez explained the dev process when he posted a pure C implementation of the Flux 2 Klein image gen model, at https://news.ycombinator.com/item?id=46670279
flakiness - 3 hours ago

I believe the assumption is: The code is cheap. The collaboration (eg. upstreaming) is expensive.
Is it true? We'll see, in a few years.
fgfarben - 2 hours ago

At a certain point the level of abstraction / genericization necessary for a big flexible project (like llama.cpp or Linux) blows things up into a huge number of files. Something newer and smaller can move faster.

easythrees - an hour ago

I thought for a moment there was a Dark Souls 4

JavierFlores09 - an hour ago

Glad I wasn't the only one, my second thought was Dual Shock controller but that wasn't it either lol
NDlurker - an hour ago

I was thinking dual shock 4

minimaxir - 3 hours ago

A relevant recent tweet from antirez: https://x.com/antirez/status/2054854124848415211

> Gentle reminder on how, in the recent DS4 fiesta, not just me but every other contributor found GPT 5.5 able to help immensely and Opus completely useless.

I've noticed the same for lower level squeezing-as-much-performance-as-possible code work.

- 36 minutes ago

[deleted]
throwaway041207 - 2 hours ago

Assuming we are talking about Code/Codex are you on API billing or subscription? I have essentially unlimited API billing at my disposal and I haven't noticed any degradation of quality across Opus versions.
- chatmasta - an hour ago
  
  Same here, the enterprise version of Claude has been great. Luckily I’m not the one paying for it. We also have CoPilot and when GPT-5.4 came out, and was 1x request cost, I was very impressed but haven’t had much time to compare the two.
  I also don’t have time to do much personal coding outside of work, so I haven’t subscribed to a personal one yet. But I intend to go for Codex just to balance the Claude at work and also because of the hostile moves from Anthropic toward their consumer business.
sanxiyn - 2 hours ago

There is a benchmark for performance work, and I think it is not being optimized by model vendors. The latest result from GSO is that both Opus 4.6 and 4.7 slightly outperforms GPT 5.5. This also matches my experience.
https://gso-bench.github.io/
- vitorsr - an hour ago
  
  Tasks are taken from commit histories in public Git repositories which defeats the purpose.

simonw - 4 hours ago

I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.

chatmasta - an hour ago

So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.
- simonw - 21 minutes ago
  
  I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.
  Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.
  I'm an LLM nerd so running local models is worth it from a research perspective.
perfmode - 4 hours ago

How’s the token throughput / response time?
- simonw - 4 hours ago
  Healthy!
  prefill: 30.91 t/s, generation: 29.58 t/s
  From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...
  - embedding-shape - 3 hours ago
    
    Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:
    prefill: 121.76 t/s, generation: 47.85 t/s
    Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.
  - rtpg - an hour ago
    
    what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?
  - xienze - 3 hours ago
    
    I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
    
    fgfarben - 2 hours ago
    
    That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...
    
    aiscoming - 3 hours ago
    
    if it's just the coding agent system prompt and tools, you can cache that
    
    xienze - 3 hours ago
    
    Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.

sbinnee - 3 hours ago

It is a big thing for sure to have a competitive local agentic model. I've replaced gemini 3 flash preview with DeepSeek v4 flash for all of my personal use cases. Starting from chat app, language learning, and even hobby coding. For coding, I couldn't get decent results no matter which sota latest models I used before. It's not close to Opus or Codex models. It's a flash model and makes mistakes here and there (I just saw `from opentele while import trace`, new Python syntax!)

But I found its tool calling is reliable than other oss models I tried. I assume that it attributes to interleaved thinking. Its reasoning effort is adjusted automatically by queries. I enjoy reading these reasoning traces from open models because you can't see them from proprietary models.

I would love to try DS4 so bad. Well, I don't have a machine for it. I will just stick to openrouter. I wish I can run a competitive oss model on 32GB machine in 3 years.

zozbot234 - 3 hours ago

> I wish I can run a competitive oss model on 32GB machine in 3 years.
You could try DS4 on that machine anyway and see how gracefully it degrades (assuming that it runs and doesn't just OOM immediately). Experimenting with 36GB/48GB/64GB would also be nice; they might be able to gain some compute throughput back by batching multiple sessions together (though obviously at the expense of speed for any single session).
kristianp - 2 hours ago

> I wish I can run a competitive oss model on 32GB machine in 3 years.
It's so hard to predict what size the open-weight models will be, even in 6 months time. Will a 96GB machine turn out to be a complete waste of money? Who knows.
thegeomaster - 3 hours ago

> `from opentele while import trace`
FYI, this to me points to an inference bug, bad sampling, or a non-native quant. OpenRouter is known to route requests to absolutely terrible, borked implementations. A model like DeepSeek V4 Flash shouldn't be making syntax errors like this.

bjconlan - 4 hours ago

This is great! I feel the same way about the deepseek v4 architecture for commodity hardware.

Also have enjoyed playing with https://huggingface.co/HuggingFaceTB/nanowhale-100m-base (but early days for me understanding this space)

kamranjon - 3 hours ago

Very cool! I had no idea that HF was doing this - I really love their small model experiments.

kamranjon - 4 hours ago

Just want to mention that I've been pulling down and using DwarfStar locally and it's incredible. I actually have it running on my personal macbook m4 max with 128gb of ram and I am running the server to share it through tailscale with my work laptop and just have pi running there.

The long context reasoning is something I haven't even seen in frontier models - I was running at 124k tokens earlier and it was still just buzzing along with no issues or fatigue.

I am amazed at how well it works, I'm using it right now for some pretty complex frontend work, and it is much much faster than, for example running a dense 27b or 31b model (like qwen or gemma) for me (The benefits of MoE) - but the long context capabilities have been what have been absolutely flooring me.

Super excited about this project and hope Antirez can keep himself from burning out - i've been following the repo pretty closely and there are a ton of PR's flooding in and it seems like he's had to do a lot of filtering out of slop code.

le-mark - 3 hours ago

Is DS4 dwarf star 4 or deep seek 4?
- kamranjon - 3 hours ago
  
  Just updated! Sorry I meant Dwarf Star - it's the only way I've actually managed to run DeepSeek flash on my local hardware
  - zackify - an hour ago
    
    Are you on q2?
    
    kamranjon - 11 minutes ago
    
    Yea I'm on the imatrix q2 version now
- wolttam - 3 hours ago
  
  DwarfStar 4 is DeepSeek 4 (check the repo)

- 3 hours ago

[deleted]

brcmthrowaway - 3 hours ago

This guy is falling deep into Yegge-tier psychosis.

linkregister - 2 hours ago

Empirically, DS4 is hosting the DeepSeek v4 Flash model with good performance on home hardware. I'm curious how you came to this conclusion.
- dakolli - 2 hours ago
  
  "Empirically", have you tested this yourself?
  - linkregister - 3 minutes ago
    
    It's trivial to find reviews and benchmarks of DS4 online. Also, there are benchmarks in the article.
    Here's one of the top hits: https://forums.developer.nvidia.com/t/fully-custom-cuda-nati...
    Bizarre comment; sounds like "How do you know Porsches are fast? Did you drive one?"
fgfarben - 2 hours ago

Nope.

codedokode - 3 hours ago

I thought DeepSeek was closed-weights and proprietary? I wonder how it compares against Western open-weight models. The hugging face page contains the comparison only with proprietary models for some reason.

itishappy - 3 hours ago

DeepSeek has always been open-weight, and the DeepSeek HuggingFace page does not contain any comparisons. Where did you form these opinions?
- codedokode - 3 hours ago
  
  It contains comparisons: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
  - itishappy - 2 hours ago
    
    Just the first one then...
    Apologies. Where did I form my opinions?
- - 2 hours ago
  
  [deleted]
zozbot234 - 3 hours ago

Nemotron would be a comparable Western open model AIUI.