DeepSeek 4 Flash local inference engine for Metal

github.com

223 points by tamnd 6 hours ago


kgeist - 4 hours ago

Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

maherbeg - 6 hours ago

This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

antirez - 4 hours ago

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

visarga - 4 hours ago

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

Havoc - 42 minutes ago

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

amunozo - 4 hours ago

I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.

sourcecodeplz - 4 hours ago

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

nazgulsenpai - 4 hours ago

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

brcmthrowaway - 2 hours ago

How does this compare with oMLX?

m00dy - 6 hours ago

[dead]

happyPersonR - 4 hours ago

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)