How to Think About GPUs

jax-ml.github.io

314 points by alphabetting 2 days ago


hackrmn - 8 hours ago

I find the piece, much like a lot of other documentation, "imprecise". Like most such efforts, it likely caters to a group of people expected to benefit from being explained what a GPU is, but it fumbles it terms, e.g. (the first image with burned-in text):

> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"

It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.

And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.

My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.

Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.

I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".

I can always ask $AI to do the equivalent for me, which is a tragedy according to some.

tormeh - 12 hours ago

I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.

nickysielicki - 15 hours ago

The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.

Shamelessly: I’m open to work if anyone is hiring.

gchadwick - 3 hours ago

This whole series is fantastic! Does an excellent job of explaining the theoretical limits to running modern AI workloads and explains the architecture and techniques (in particular methods of parallelism) you can use.

Yes it's all TPU focussed (other than this most recent part) but a lot of what it discusses are generally principles you can apply elsewhere (or easy enough to see how you could generalise them).

pbrumm - 4 hours ago

If you have optimized your math heavy code and it is already in a typed language and you need it to be faster, then you think about the GPU options

In my experience you can roughly get 8x speed improvement.

Turning a 4 second web response into half a second can be game changing. But it is a lot easier to use a web socket and put a spinner or cache result in background.

Running a GPU in the cloud is expensive

ngcc_hk - 15 minutes ago

This is part 12 … the title seems to hint on how do one think about Gpu today … eg why llm comes about. Instead it is about cf with tpu? And then I note the part 12 … not sure what one should expect to jump in the middle of a whole series and what … well may stop and move on.

physicsguy - 13 hours ago

It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.

Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.

radarsat1 - 6 hours ago

Why haven't Nvidia developed a TPU yet?

gregorygoc - 13 hours ago

It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.

What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.

einpoklum - 7 hours ago

We should remember that these structural diagrams are _not_ necessarily what NVIDIA actually has as hardware. They carefully avoid guaranteeing that any of the entities or blocks you see in the diagrams actually _exist_. It is still just a mental model NVIDIA offers for us to think about their GPUs, and more specifically the SMs, rather than a simplified circuit layout.

For example, we don't know how many actual functional units an SM has; we don't know if the "tensor core" even _exists_ as a piece of hardware, or whether there's just some kind of orchestration of other functional units; and IIRC we don't know what exactly happens at the sub-warp level w.r.t. issuing and such.

aanet - 17 hours ago

Fantastic resource! Thanks for posting it here.

- 16 hours ago
[deleted]
akshaydatazip - 13 hours ago

Thanks for the really thorough research on that . Right what I wanted for my morning coffee

tucnak - 12 hours ago

This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.

Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.

porridgeraisin - 2 days ago

A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.

evrennetwork - 9 hours ago

[dead]

tomhow - 13 hours ago

Discussion of original series:

How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)

varelse - 4 hours ago

[dead]

business_liveit - 5 hours ago

so, Why didn't Nvidia developed a TPU yet?