Why DeepSeek is cheap at scale but expensive to run locally

328 points by ingve 10 months ago

I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second.

refibrillator - 10 months ago

> Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original
How close are we talking?
I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.
Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.
I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.
However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.
- danielhanchen - 10 months ago
  
  Oh hey :) Thanks for the kind words - we did provide benchmarks (MMLU, KLD, Perplexity) for Llama 4 Scout, Gemma 3 27B using our methodology - https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs and https://x.com/UnslothAI/status/1915476692786962441
  For R1 specifically, we did an internal benchmark on the original model - https://unsloth.ai/blog/deepseekr1-dynamic
  For R1-0528 specifically on evals - we're still running them :)) It's quite expensive to run, so we first do "vibe check" on some internal test cases, and they do pretty well!
  But we generally stress the bug fixes that we do, which objectively increase performance by +1 to sometimes +10% accuracy - for example Llama 4 bug fixes, Gemma bug fixes - https://news.ycombinator.com/item?id=39671146 etc are much more important :)
  We also provide Q8_0 and Q8_K_XL quants, which are mostly equivalent to FP8 - you can also use the magical `-ot ".ffn_.*_exps.=CPU"` incantation to offload MoE layers to RAM!
  - saurik - 10 months ago
    
    > All Distilled and the original R1 versions seem to have accidentally assigned the padding token to <｜endofsentence｜>, which is mostly not a good idea, especially if you want to further finetune on top of these reasoning models. This will cause endless infinite generations, since most frameworks will mask the EOS token out as -100.
    I couldn't tell if this was an error in the code running the model or in the model weights themselves; if/assuming the former, are these fixes being upstreamed to anywhere?
- ryan_glass - 10 months ago
  
  You are right that I haven't been rigorous - it's easy to benchmark tokens/second but quality of output is more difficult to nail down. I couldn't find any decent comparisons for Unsloth either. So I just tried a few of their models out, looking for something that was 'good enough' i.e. does all I need: coding, summarizing documents, troubleshooting anything and everything. I would like to see head to head comparisons too - maybe I will invest in more RAM at some stage but so far I have no need for it. I ran some comparisons between the smaller and larger versions of the Unsloth models and interestingly (for me anyway) didn't notice a huge amount of difference in quality between them. But, the smaller models didn't run significantly faster so I settled for the biggest model I could fit in RAM with a decent context. For more complex coding I use Deepseek R1 (again the Unsloth) but since it's a reasoning model it isn't real-time so no use as my daily driver.
  - danielhanchen - 10 months ago
    
    Thanks for using our quants and appreciate it :) - We're still doing internal benchmarks since they're very slow to do - but they definitely pass our internal benchmarks :)
    
    ryan_glass - 10 months ago
    
    Thank you for making the dynamic quantisations! My setup wouldn't be possible without them and for my personal use, they do exactly what I need and are indeed excellent.
  - ysosirius - 10 months ago
    
    How do you find the quality of the output compares to that of, say, o3 or Sonnet 4?
    
    ryan_glass - 10 months ago
    
    To be honest I haven't used o3 or Sonnet as the code I work with is my own proprietary code which I like to keep private, which is one reason for the local setup. For troubleshooting day to day things I have found it at least as good as than the free in-browser version of ChatGPT (not sure which model it uses).
jeff_carr - 10 months ago

I am impressed. Your personal website is down. HN doesn't allow private messages.
I'm Jeff Carr. I co-founded digital ocean. I assume I can't post email addresses here, but I will try. lets see how smart things are from banning me. I am: wit AT wit com
- p12tic - 10 months ago
  
  State of the art of local models is even further.
  For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.
  The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.
  - qeternity - 10 months ago
    
    It's not impressive nor efficient when you consider batch sizes > 1.
    
    p12tic - 10 months ago
    
    All of this is for batch size 1.
    
    qeternity - 10 months ago
    
    I know. That was my point.
    Throughput doesn't scale on CPU as well as it does on GPU.
    
    p12tic - 10 months ago
    
    We both agree. Batch size 1 is only relevant to people who want to run models on their own private machines. Which is the case of OP.
- saagarjha - 10 months ago
  
  Pretty sure you can post email addresses here, this is mine: saagar@saagarjha.com. It's more about avoiding spam.
- stavros - 10 months ago
  
  You can post emails fine, you just might get spammed (because it's a public forum).
- adastra22 - 10 months ago
  
  You can put your email in your profile
- trustinmenowpls - 10 months ago
  
  fyi, your website is also down... wit.com doesn't resolve for me
  - x______________ - 10 months ago
    
    Bold of you to assume that an email domain needs a web server listening on port 80 for http packets..
    
    trustinmenowpls - 10 months ago
    
    I went to his linkedin which has a link to wit.com as his website
    
    simondotau - 10 months ago
    
    You don’t even need an A/AAAA record on the domain.
twotwotwo - 10 months ago

The latest V3 strikes me as a really practical go-to among open-weights models. Lots of tasks don't need the reasoning tokens, and not having to wait for them is nice. (If something does need it you can always switch.) If you're not running it yourself a couple providers have it with full context, 80tps, and a promise not to use your data.
9004 home server is awesome!
platevoltage - 10 months ago

Impressive. I need to look more into this. I'm doing my best to limit my LLM usage to what I can run locally.
nardi - 10 months ago

Whats your prompt processing speed? That’s more important in this situation than output TPS. If you have to wait minutes to start getting an answer, that makes it much worse than a cloud-hosted version.
- ryan_glass - 10 months ago
  
  Prompt eval time varies a lot with context but it feels real-time for short prompts - approx 20 tokens per second but I haven't done much benchmarking of this. When there is a lot of re-prompting in a long back and forth it is still quite fast - I do use KV cache which I assume helps and also quantize the KV cache to Q8 if I am running contexts above 16k. However, if I want it to summarize a document of say 15,000 words it does take a long time - here I walk away and come back in about 20 minutes and it will be complete.
- ryao - 10 months ago
  
  If he is doing multiturn conversations, he can reuse the kv cache from the last turn and skip the prompt processing on the history that would make time to first token too slow, by only doing prompt processing on his actual prompt for the current turn. This turns a quadratic amount of tokens to process into a linear number. I am not sure if this is what he is doing, but that is what I would do if I had his hardware.
- pclmulqdq - 10 months ago
  
  I assume KV caching makes this a non issue, but I'm also curious.
  - idonotknowwhy - 10 months ago
    
    If you're just chatting with it starting with "Hi", that's correct. The conversation remains in the KV cache as it grows gradually.
    But if you're posting code, writing drafts, or even small snippets of articles, etc in there it becomes a huge problem.
    
    pclmulqdq - 10 months ago
    
    Usually, when people think about the prompt tokens for a chat model, the initial system prompt is the vast majority of the tokens and it's the same regardless for many usage modes. You might have a slightly different system prompt for code than you have for English or for chatting, but that is 3 prompts which you can permanently put in some sort of persistent KV cache. After that, only your specific request in that mode is uncached.
mechagodzilla - 10 months ago

I use a dual-socket 18-core (so 36 total) xeon with 768GB of DDR4, and get about 1.5-2 tokens/sec with a 4-bit quantized version of the full deepseek models. It really is wild to be able to run a model like that at home.
- stirfish - 10 months ago
  
  Dumb question: would something like this have a graphics card too? I assume not
  - mechagodzilla - 10 months ago
    
    Yeah, it was just a giant HP workstation - I currently have 3 graphics cards in it (but only 40GB total of VRAM, so not very useful for deepseek models).
jbellis - 10 months ago

impressive, but that's 1/5 to 1/10 of the throughput that you'd get with a hosted provider, with 1/4 to 1/8 the supported context
- ryan_glass - 10 months ago
  
  It might be 5 to 10 times slower than a hosted provider but that doesn't really matter when the output is still faster than a person can read. Context wise, for troubleshooting I have never needed over 16k and for the rare occasion when I need to summarise a very large document I can change up the model to something smaller and get a huge context. I have never needed more than 32k though.
- - 10 months ago
  
  [deleted]
- michelsedgh - 10 months ago
  
  Dude he's running locally, and I think this setup is the best bang for the buck if you wanna run locally, we're not comparing to data centers, you gotta keep it in perspective. That's very impressive results for running local. Thanks for the numbers you saved me a chatgpt search :)
  - carstenhag - 10 months ago
    
    Title says: locally it's expensive
    Other person says: I had to spend 4000$ and it's still slow
    
    justsid - 10 months ago
    
    Not to mention that $4000 is in fact expensive. If anything the OP really makes the point of the articles title.
  - BoorishBears - 10 months ago
    
    CPU-only is really terrible bang for your buck, and I wish people would stop pushing these impractical builds on people genuinely curious in local AI.
    The KV cache won't soften the blow the first time they paste a code sample into a chat and end up waiting 10 minutes with absolutely no interactivity before they even get first token.
    You'll get an infinitely more useful build out of a single 3090 and sticking to stuff like Gemma 27B than you will out of trying to run Deepseek off a CPU-only build. Even a GH200 struggles to run Deepseek at realistic speeds with bs=1, and there's an entire H100 attached to CPU there: there just isn't a magic way to get "affordable fast effective" AI out of a CPU offloaded model right now.
    
    ryan_glass - 10 months ago
    
    The quality on Gemma 27B is nowhere near good enough for my needs. None of the smaller models are.
    
    BoorishBears - 10 months ago
    
    And that's fine, but the average person asking is already willing to give up some raw intelligence going local, and would not expect the kind of abysmal performance you're likely getting after describing it as "fast".
    I setup Deepseek bs=1 on a $41,000 GH200 and got double digit prompt processing speeds (~50 tk/s): you're definitely getting worse performance than the GH200 was, and that's already unacceptable for most users.
    They'd be much better served spending less money than you had to spend and getting an actually interactive experience, instead of having to send off prompts and wait several minutes to get an actual reply the moment the query involves any actual context.
goldielox - 10 months ago

So, in your opinion, hardware wise, as a general purpose tinkering/learning self lab hardware, how would you grade the decked out framework desktop for 2.7k?
blindriver - 10 months ago

I thought GPUs with a lot of extremely fast memory was required for inference. Are you saying that we can accomplish inference with just a large amount of system memory that is non-unified and no GPU? How is that possible?
- ryan_glass - 10 months ago
  
  Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.
- adastra22 - 10 months ago
  
  I’m confused as to why you think a GPU is necessary? It’s just linear algebra.
  - oreoftw - 10 months ago
    
    most likely he was referring the fact that you need plenty of GPU-fast memory to keep the model, and GPU cards have it.
    
    - 10 months ago
    
    [deleted]
    
    adastra22 - 10 months ago
    
    There is nothing magical about GPU memory though. It’s just faster. But people have been doing CPU inference since the first llama code came out.
3eb7988a1663 - 10 months ago

Do you have hard numbers on the idle/average/max power draw? I assumed that server machines are built as if they are going to red-lined constantly so put less effort into low-utilization optimizations.
- ryan_glass - 10 months ago
  
  No hard numbers I'm afraid in that I don't monitor the power draw. But the machine uses a standard ATX power supply: a Corsair RM750e 750W PSU and the default TDP of the CPU is 280W - I have my TDP set at 300W. It is basically built like a desktop - ATX form factor, fans spin down at idle etc.
  - 3eb7988a1663 - 10 months ago
    
    Approximation is still better than I was expecting. You said supermicro and I was assuming a pizza box with dual power supplies sucking down 1kw at idle. That it can run with a large, but not unreasonable PSU says enough.
6Az4Mj4D - 10 months ago

Can we run Deepseek using Ollama or something similar for code generation like Github copilot on a 40 core CPU with about 256GB RAM say 200 GB usable for the model?
dotancohen - 10 months ago

Just curious what your use cases are? What type of texts are you producing?
Thank you.
- ysosirius - 10 months ago
  
  I've always wondered this as well, and never seem to get an answer. Why would someone want to do this when they can get a better result either renting in the cloud, or just using a subscription?
  Obviously I see the value in having something local from a control and privacy perspective, but it's surely always a net loss in terms of quality and capability of output, right?
- ryan_glass - 10 months ago
  
  Coding, my own proprietary code hence my desire for local hosting, a decent amount of legacy code. General troubleshooting of anything and everything from running Linux servers to fixing my car. Summarizing and translation of large documents occasionally. Also, image generation and other automations but obviously not LLMs for this.
  - dotancohen - 10 months ago
    
    Terrific, thank you.
    If you don't mind another question, how do you adapt the LLM to your codebase? Keep the whole thing in context? Fine tune on your own code? Fine tune on lots of code in whatever language you're using (e.g. Python, Rust)? Just rely on the original model training?
    Thank you very much!
pclmulqdq - 10 months ago

CPUs are quietly becoming very well-balanced machines for BS 1 inference. The latest Intel Xeons should be at ~20 TPS.
- Spooky23 - 10 months ago
  
  A base Mac Mini is ~20 :)
  - pclmulqdq - 10 months ago
    
    Oh yeah, I did that math not assuming any quantization. I think if you can get a 3-4 bit quant working + int8 math, ~80 might be achievable.

ipieter - 10 months ago

This is an interesting blogpost. While the general conclusion ("We need batching") is true, inference of mixture of experts (MoE) models is actually a bit more nuanced.

The main reason we want big batches is because LLM inference is not limited by the compute, but my loading every single weight out of VRAM. Just compare the number of TFLOPS of an H100 with the memory bandwidth, there's basically room for 300 FLOP per byte loaded. So that's why we want big batches: we can perform a lot of operations per parameter/weight that we load from memory. This limit is often referred to as the "roofline model".

As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

So what MoE allows is expert parallelism, where different nodes keep different experts in memory and don't need to communicate as much between nodes. This only works if there are enough nodes to keep all experts in VRAM and have enough overhead for other stuff (KV cache, other weights, etc). So naturally the possible batch size becomes quite large. And of course you want to maximize this to make sure all GPUs are actually working.

zozbot234 - 10 months ago

You could load different "experts" in a round-robin way on a single node and only aggregate "batches" opportunistically, when you just have multiple requests in-flight that all happen to rely on the same "expert". The difference being that instead of "batches", you would only really have queues. Of course this would come with a sizeable increase in latency, but that's acceptable for many applications (such as for "deep research" workflows)
- jchrisa - 10 months ago
  
  This is very much like Erlang's actor model. The same compute can be run in parallel, or managed via queues. With Erlang's strong support for FFI and process control, I wonder if it's being used as a dispatcher for these sorts of workloads.
ryao - 10 months ago

> As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.
Inference works by computing layers and then have a very small vector that you send to the next layer as input. When a model does not fit in a single GPU, you just divide it into layers and send the vector over a fabric to the GPU holding the next layer. The transfer happens so quickly that there is a negligible amount of idle time and then the next layer can be computed. The fastest inference on the planet at Cerebras uses this technique to do 2500T/sec on Llama 4 Maverick.
- jimmySixDOF - 10 months ago
  
  Groq and Cerebras both take a big chip approach to architecture and, at least in the case of Groq, they only make economic sense under high batch loads.
  https://x.com/swyx/status/1760065636410274162?s=46
  - ryao - 10 months ago
    
    There is nothing big about Groq’s chips. Their individual chips have only 230 MB RAM. Unlike Cerebras, which can load multiple layers into a single chip, grok must divide a layer across many chips.
- ipieter - 10 months ago
  
  Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.
  - ryao - 10 months ago
    
    In what software? llama.cpp and others divide things by layers.
cyptus - 10 months ago

could such a network with all its nodes and weights be deployed to an analog circuit and be superfast?
- TuringNYC - 10 months ago
  
  Do you mean something like this? https://www.etched.com/
- rpmisms - 10 months ago
  
  Please go into more detail about this proposal, this piqued my interest in a really strange way.
  - cyptus - 10 months ago
    
    The idea is to replicate the weights of the network in the electronics. Somehow like our brains work? This way an analog input signal could lead to a neural network processed output signal without the digital emulation on an gpu. As this is very much simplified, the question is if this could work for modern llms?
    
    koiueo - 10 months ago
    
    Suddenly "temperature" parameter starts making sense
    (If you ever tried fine-tuning an analog circuit, you'll know how finicky the process due to the environment, including temperature)
    
    cyptus - 10 months ago
    
    haha very true!
iwontberude - 10 months ago

And this is the investment case for AMD, models fit entirely in a single chassis, and side benefit: less tariffed network equipment to interconnect compute. Map/reduce instead of clustered compute.
Edit: when downvoting, please offer some insight why you disagree
- dragonwriter - 10 months ago
  
  How is the a unique advantage for AMD?
  - latchkey - 10 months ago
    
    AMD is consistently stacking more HBM.
    H100 80GB HBM3 H200 141GB HBM3e B200 192GB HBM3e MI300x 192GB HBM3 MI325x 256GB HBM3e MI355x 288GB HBM3e
    This means that you can fit larger and larger models into a single node, without having to go out over the network. The memory bandwidth on AMD is also quite good.
    
    ryao - 10 months ago
    
    It really does not matter how much memory AMD has if the drivers and firmware are unstable. To give one example from last year:
    https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...
    They are currently developing their own drivers for AMD hardware because of the headaches that they had with ROCm.
    
    latchkey - 10 months ago
    
    "driver" is such a generic word. tinygrad works on mi300x. If you want to use it, you can. Negates your point.
    Additionally, ROCm is a giant collection of a whole bunch of libraries. Certainly there are issues, as with any large collection of software, but the critical thing is whether or not AMD is responsive towards getting things fixed.
    In the past, it was a huge issue, AMD would routinely ignore developers and bugs would never get fixed. But, after that SA article, Lisa lit a fire under Anush's butt and he's taking ownership. It is a major shift in the entire culture at the company. They are extremely responsive and getting things fixed. You can literally tweet your GH issue to him and someone will respond.
    What is true a year ago isn't today. If you're paying attention like I am, and experiencing it first hand, things are changing, fast.
    
    ryao - 10 months ago
    
    I have been hearing this about AMD/ATI drivers for decades. Every year, someone says that it is fixed, only for new evidence to come out that they are not. I have no reason to believe it is fixed given the history.
    Here is evidence to the contrary: If ROCm actually was in good shape, tinygrad would use it instead of developing their own driver.
    
    DiabloD3 - 10 months ago
    
    You're conflating two different things.
    ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.
    The part of ROCm you're interested in is HIP; HIP is the part that does legacy CUDA emulation. HIP will never be complete because Nvidia keeps adding new things, documents things wrong, and also the "cool" stuff people do on Nvidia cards aren't CUDA and it is out of scope for HIP to emulate PTX (since that is strongly tied to how historical Nvidia architectures worked, and would be entirely inappropriate for AMD architectures).
    The whole thing with Tinygrad's "driver" isn't a driver at all, its the infrastructure to handle card to card ccNUMA on PCI-E-based systems, which AMD does not support: if you want that, you buy into the big boy systems that have GPUs that communicate using Infinity Fabric (which it, itself, is the HyperTransport protocol over PCI-E PHY instead of over HyperTransport PHY; PCI over PCI-E has no ability to handle ccNUMA meaningfully).
    Extremely few customers, AMD's or not, want to share VRAM directly over PCI-E across GPUs since most PCI-E GPU customers are single GPU. Customers that have massive multi-GPU deployments have bought into the ecosystem of their preferred vendor (ie, Nvidia's Mellanox-powered fabrics, or AMD's wall-to-wall Infinity Fabric).
    That said, AMD does want to support it if they can, and Tinygrad isn't interested in waiting for an engineer at AMD to add it, so they're pushing ahead and adding it themselves.
    Also, part of Tinygrad's problem is they want it available from ROCm/HIP instead of a standards compliant modern API. ROCm/HIP still has not been ported to the modern shader compiler that the AMD driver uses (ie, the one you use for OpenGL, Vulkan, and Direct family APIs), since it originally came from an unrelated engineering team that isn't part of the driver team.
    The big push in AMD currently is to unify efforts so that ROCm/HIP is massively simplified and all the redundant parts are axed, so it is purely a SPIR-V code generator or similar. This would probably help projects like Tinygrad someday, but not today.
    
    ryao - 10 months ago
    
    > ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.
    AMD says otherwise:
    > AMD ROCm™ is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.
    https://www.amd.com/en/products/software/rocm.html
    The issues involving AMD hardware not only applied to the drivers, but to the firmware below the drivers:
    https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...
    Tinygrad’s software looks like a userland driver:
    https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...
    It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.