Nvidia greenboost: transparently extend GPU VRAM using system RAM/NVMe

gitlab.com

450 points by mmastrac 4 days ago


0xbadcafebee - 21 hours ago

You can already do this with some GPU drivers:

  GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdttm.pages_limit=5242880 ttm.pages_limit=5242880"
One downside is your kernel isn't going to reserve that memory away from userland. You will still see all the memory at system level as "free". As the GPU driver starts using it, other apps/the OS will try to use the "free" memory, not knowing how much of it is in use (it may show up as "cache", or not at all). Then OOM killer starts going or programs start crashing, and at some point the OS tips over or GPU driver crashes. You can add loads of swap as a compromise and it works okay, if a bit slow.

In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.

rnrn - 4 hours ago

Why is there a new kernel driver here at all? It appears that all it does it allocate system RAM (“DDR4”) and export it as a dmabuf for import to cuda as mapped external memory. Then a userspace shim hijacks APIs to use that if gpu memory is full. cuda already supports allocating mapped system memory, so AFAICT this could be implemented in the userspace shim with no new kernel driver.

Also as other commenters have mentioned, redirecting allocations to managed memory would also enable similar oversubscription

And the hijack approach only makes sense for making apps have this behavior with no changes, and could be done with minor app changes (e.g. PyTorch has a pluggable allocator interface). App changes also enable intentionally placing specific allocations.

My impression is that this is vibe from beginning to end, starting from a design that only makes sense if you are hallucinating

daneel_w - 21 hours ago

Related, a couple of years ago: https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...

"I turned a $95 AMD APU into a 16GB VRAM GPU and it can run stable diffusion!"

nl - 20 hours ago

This is really interesting engineering, but I agree with the other commentators that the benchmarking makes it hard to understand the contribution various factors are having.

The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)

But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?

I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.

It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.

I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.

aruametello - 7 hours ago

Post traumatic "nvidia TurboCache" disorder triggered.

https://en.wikipedia.org/wiki/TurboCache

(Not the same thing 1:1, but worth the joke anyway)

yjtpesesu2 - 21 hours ago

How does this differ from anything llama.cpp offers, regarding offloading layers? The repo consistently refers to "DDR4". Is there a reason DDR5 won't work with this?

Havoc - 20 hours ago

> The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.

Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.

Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet

ma2kx - 20 hours ago

The physical bottleneck to system memory remains. Therefore, I assume that better results are achieved by manually adjusting which layers are offloaded.

I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.

ninjagoo - 8 hours ago

This is awesome! Normally, offloading layers to the CPU RAM means that the compute for those layers occurs on the CPU instead of the GPU, generally speaking. The CPU is orders of magnitude slower than the GPU.

With this approach the compute occurs on the GPU, with the tradeoff that layers in RAM have to be moved back-and-forth through PCI-DMA. It seems to me that this should offer a speedup vs compute split between GPU and CPU. The amount of speedup will depend on how many layers would have been on CPU compute, minus the reduction due to moving those layers between RAM and the GPU.

What's slower? Compute on the CPU or moving data from RAM to GPU through PCI-DMA?

152334H - 11 hours ago

Nobody mentioning how this project is vibecoded slop?

  > The code is really bad with completely uneeded parts. The LLM (Qwen 2.5 7B) has hardcoded the i9 14700KF topology, and has variables related to it never used... It's even funnier that the show hardware function always prints the same string. There are even random pip log files. Why did this slop got coverage here?
https://www.phoronix.com/forums/forum/linux-graphics-x-org-d...
yjftsjthsd-h - a day ago

Previously: https://news.ycombinator.com/item?id=47384557

(Still cool, still would benefit from better benchmarks)

armada651 - 18 hours ago

Doesn't Windows already do this by default? I can already run models bigger than my GPU VRAM and it will start using up to 50% of my system RAM as "shared memory". This is on a Desktop PC without a shared memory architecture.

wewewedxfgdf - 8 hours ago

Why don't they just put ram slots on the card so you can augment the fast ram

dwroberts - 10 hours ago

The title here needs changing, this is for nvidia cards but it is not an official project and has nothing to do with them

(Feels especially deceptive when there is another top story right with the headline “nvidia nemoclaw” which is an official project)

Insanity - 19 hours ago

Extend your VRAM using RAM, then extend your RAM using Swap.

paultendo - 21 hours ago

Could be a very useful way to do some overnight tasks using spare RAM. Possibly things like LLM-based categorisation, labelling, data cleansing. That's what comes to mind for me anyway.

- 13 hours ago
[deleted]
yalogin - 5 hours ago

Is there a use case for this today? Feels more like nvidia is priming the software hoping system designers will find ways to use it.

bguberfain - 6 hours ago

"A watchdog kernel thread monitors RAM and NVMe pressure and signals userspace before things get dangerous." - which kind of danger this type of solution can have?

dr_kretyn - 7 hours ago

Is there a similar initiative for AMD?

bhewes - 21 hours ago

This has been fun we can task our nemotron-3-super model to run over night when our desktops are idle. 4070s and 96gb of ram works fine. Slow but does it's job.

angry_octet - 7 hours ago

I have a system with an ungodly amount of Optane memory and I'm hoping this will work.

sabareesh - 21 hours ago

I wish it provided benchmark comparing Direct RAM offload vs CPU offload vs Full VRAM

felipe_aramburu - 18 hours ago

How does this relate to cuCascade https://github.com/nvidia/cucascade

- 8 hours ago
[deleted]
Berazu - 11 hours ago

I wish there was a way to extend RAM/NVMe with GPU VRAM. :(

tandr - 4 days ago

Some simpler benchmark table would be great. May I suggest Ollama on base machine, Ollama with T1, Ollama with T1+T2 etc. on midsize and big models to compare token/sec?

bandrami - 10 hours ago

Qu'ils mangent de la brioche

pabs3 - 4 days ago

Would be great to get this into mainline Linux.

brador - 11 hours ago

Could this work on steam deck?

aplomb1026 - 21 hours ago

[dead]

ajaimk - 20 hours ago

[dead]

Heer_J - 6 hours ago

[dead]

NooneAtAll3 - 15 hours ago

nvidia failed to provide gpu with actually meaningful amount of vram

and instead of improving the actual product, it decided to "solve the problem in software"

I expect this greenboost to fall and burn, honestly...

holoduke - a day ago

The is extremely slow and not useful in my opinion.