Cray versus Raspberry Pi
aardvark.co.nz146 points by flyingkiwi44 5 days ago
146 points by flyingkiwi44 5 days ago
My former boss (Steve Parker, RIP) shared a story of Turner Whitted making predictions about how much compute would be needed to achieve real-time ray tracing, some time around when his seminal paper was published (~1980). As the story goes, Turner went through some calculations and came to the conclusion that it’d take 1 Cray per pixel. Because of the space each Cray takes, they’d be too far apart and he thought they wouldn’t be able to link it to a monitor and get the results in real time, so instead you’d probably have to put the array of Crays in the desert, each one attached to an RGB light, and fly over it in an airplane to see the image.
Another comparison that is equally astonishing to the RPi is that modern GPUs have exceeded Whitted’s prediction. Turner’s paper used 640x480 images. At that resolution, extrapolating the 160 Mflops number, 1 Cray per pixel would be 49 Tera flops. A 4080 GPU has just shy of 50 Tflops peak performance, so it has surpassed what Turner thought we’d need.
Think about that - not just faster than a Cray for a lot less money, but one cheap consumer device is faster than 300,000 Crays.(!) Faster than a whole Cray per pixel. We really have come a long, long way.
The 5090 has over 300 Tflops of ray tracing perf, and the Tensor cores are now in the Petaflops range (with lower precision math), so we’re now exceeding the compute needed for 1 Cray per pixel at 1080p. 1 GPU faster than 2M Crays. Mind blowing.
> 1 Cray per pixel would be 49 Tera flops. A 4080 GPU has just shy of 50 Tflops peak performance
Interesting, wonder how it compares in terms of transistors. How many transistors combined did one Cray have in compute and cache chips?
The Wikipedia article says the Cray-1 has 200k gates. I assume that would mean something slightly north of 2x the number of transistors? https://en.wikipedia.org/wiki/Cray-1#Description
200k * 300k Cray-1s would be 60B gates, whereas the 4080 actually has 46B transistors. Seems like we’re totally in the right ballpark.
But the Cray had a general purpose CPU while the GPUs have specialized hardware. Not exactly apples to apples.
The main part of the Cray was a compute offload engine that asynchronously executed job lists submitted by front end general purpose computers that ran OSes like Unix.
It was actually pretty close to the model of a GPU.
Back in 2020, someone built a working model of a Cray-1.[1] Not only is it instruction compatible, using an FPGA, it's built into a 1/10 scale case that looks like a Cray-1.
The Cray-1 is really a very simple machine, with a small instruction set. It just has 64 of everything. It was built from discrete components, almost the last CPU built that way.
[1] https://www.cpushack.com/2010/09/15/homebrew-cray-1a-1976-vs...
In 2013 I'd just built a new top-spec PC. I looked up the performance and then back-calculated using the TOP500† and I believe it would have been the most powerful supercomputer in the world in about 1993. If you back-calculated further, I think around 1980 it became more powerful than every computer on the planet combined.
And you can 3D print a Cray YMP case for your Raspberry Pi: https://www.thingiverse.com/thing:6947303
Yes but can you sit on your Raspberry Pi like this https://volumeone.org/uploads/image/article/005/898/5898/hea...
The pi has a sub $100 accelerator card that takes it to 30 TFLOPs. So you can add three more orders of magnitude of performance for a rough doubling of the price.
> but then again if you'd showed me an RPi5 back in 1977 I would have said "nah, impossible" so who knows?
I was reading lots of scifi in 1977, so I may have tried to talk to the pi like Scotty trying to talk to the mouse in Star Trek IV. And since you can run an LLM and text to speech on an RPi5, it might have answered.
You should have been watching lots of SciFi, too. (-:
I have a Raspberry Pi in a translucent "modular case" from the PiHut.
* https://thepihut.com/products/modular-raspberry-pi-4-case-cl...
It is very close to the same size and appearance as the "key" for Orac in Blake's 7.
I have so far resisted the temptation to slap it on top of a Really Useful Box and play the buzzing noise.
* https://youtube.com/watch?v=XOd1WkUcRzY
Obviously not even Avon figured out that the main box of Orac was a distraction, a fancy base station to hold the power supply, WiFi antenna, GPS receiver, and some Christmas tree lights, and all of the computational power was really in the activation key.
The amusing thing is that that is not the only 1970s SciFi telly prop that could become almost real today. It shouldn't be hard -- all of the components exist -- to make an actual Space 1999 commlock; not just a good impression of one, but a functioning one that could do teleconferencing over a LAN, IR control for doors and tellies and stuff, and remote computer access.
Not quite in time for 1999, alas. (-:
No need for an RPi 5. Back in 1982, a dual or quad-CPU X-MP could have run a small LLM, say, with 200–300K weights, without trouble. The Crays were, ironically, very well suited for neural networks, we just didn’t know it yet. Such an LLM could have handled grammar and code autocompletion, basic linting, or documentation queries and summarization. By the late 80s, a Y-MP might even have been enough to support a small conversational agent.
A modest PDP-11/34 cluster with AP-120 vector coprocessors might even have served as a cheaper pathfinder in the late 70s for labs and companies who couldn't afford a Cray 1 and its infrastructure.
But we lacked both the data and the concepts. Massive, curated datasets (and backpropagation!) weren’t even a thing until the late 80s or 90s. And even then, they ran on far less powerful hardware than the Crays. Ideas and concepts were the limiting factor, not the hardware.
I think a quad-CPU X-MP is probably the first computer that could have run (not train!) a reasonably impressive LLM if you could magically transport one back in time. It supported a 4GB (512 MWord) SRAM-based "Solid State Drive" with a supported transfer bandwidth of 2 GB/s, and about 800 MFLOPS CPU performance on something like a big matmul. You could probably run a 7B parameter model with 4-bit quantization on it with careful programming, and get a token every couple seconds.
> a small LLM, say, with 200–300K weights
A "small Large Language Model", you say? So a "Language Model"? ;-)
> Such an LLM could have handled grammar and code autocompletion, basic linting, or documentation queries and summarization.
No, not even close. You're off by 3 orders of magnitude if you want even the most basic text understanding, 4 OOM if you want anything slightly more complex (like code autocompletion), and 5–6 OOM for good speech recognition and generation. Hardware was very much a limiting factor.
I would have thought the same, but EXO Labs showed otherwise by getting a 300K-parameter LLM to run on a Pentium II with only 128 MB of RAM at about 50 tokens per second. The X-MP was in the same ballpark, with the added benefit of native vector processing (not just some extension bolted onto a scalar CPU) which performs very well on matmul.
https://www.tomshardware.com/tech-industry/artificial-intell...
John Carmack was also hinting at this: we might have had AI decades earlier, obviously not large GPT-4 models but useful language reasoning at a small scale was possible. The hardware wasn't that far off. The software and incentives were.
> EXO Labs showed otherwise by getting a 300K-parameter LLM to run on a Pentium II with only 128 MB of RAM at about 50 tokens per second
50 token/s is completely useless if the tokens themselves are useless. Just look at the "story" generated by the model presented in your link: Each individual sentence is somewhat grammatically correct, but they have next to nothing to do with each other, they make absolutely no sense. Take this, for example:
"I lost my broken broke in my cold rock. It is okay, you can't."
Good luck tuning this for turn-based conversations, let alone for solving any practical task. This model is so restricted that you couldn't even benchmark its performance, because it wouldn't be able to follow the simplest of instructions.
You're missing the point. No one is claiming that a 300K-param model on a Pentium II matches GPT-4. The point is that it works: it parses input, generates plausible syntax, and does so using algorithms and compute budgets that were entirely feasible decades ago. The claim is that we could have explored and deployed narrow AI use cases decades earlier, had the conceptual focus been there.
Even at that small scale, you can already do useful things like basic code or text autocompletion, and with a few million parameters on a machine like a Cray Y-MP, you could reasonably attempt tasks like summarizing structured or technical documentation. It's constrained in scope, granted, but it's a solid proof of concept.
The fact that a functioning language model runs at all on a Pentium II, with resources not far off from a 1982 Cray X-MP, is the whole point: we weren’t held back by hardware, we were held back by ideas.
> we weren’t held back by hardware
Llama 3 8B took 1.3M hours to train in a H100-80GB.
Of course, it didn't took 1.3M hours (~150 years). So, many machines with 80GB were used.
Let's do some napkin math. 150 machines with a total of 12TB VRAM for a year.
So, what would be needed to train a 300K parameter model that runs on 128MB RAM? Definitely more, much more than 128MB RAM.
Llama 3 runs on 16GB VRAM. Let's imagine that's our Pentium II of today. You need at least 750 times what is needed to run it in order to train it. So, you would have needed ~100GB RAM back then, running for a full year, to get that 300K model.
How many computers with 100GB+ RAM do you think existed in 1997?
Also, I only did RAM. You also need raw processing power and massive amounts of training data.
You’re basically arguing that because A380s need millions of liters of fuel and a 4km runway, the Wright Flyer was impossible in 1903. That logic just doesn’t hold. Different goals, different scales, different assumptions. The 300K model shows that even in the 80s, it was both possible and sufficient for narrow but genuinely useful tasks.
We simply weren’t looking, blinded by symbolic programming and expert systems. This could have been a wake-up call, steering AI research in a completely different direction and accelerating progress by decades. That’s the whole point.
"I mean, today we can do jet engines in garage shops. Why would they needed a catapult system? They could have used this simple jet engine. Look, here is the proof, there's a YouTuber that did a small tiny jet engine in his garage. They were held back by ideas, not aerodynamics and tooling precision."
See how silly it is?
Now, focus on the simple question. How would you train the 300K model in 1997? To run it, you someone to train it first.