RISC-V Is Sloooow
marcin.juszkiewicz.com.pl239 points by todsacerdoti 13 hours ago
239 points by todsacerdoti 13 hours ago
Don't blame the ISA - blame the silicon implementations AND the software with no architecture-specific optimisations.
RISC-V will get there, eventually.
I remember that ARM started as a speed demon with conscious power consumption, then was surpassed by x86s and PPCs on desktops and moved to embedded, where it shone by being very frugal with power, only to now be leaving the embedded space with implementations optimised for speed more than power.
In some cases RISC-V ISA spec is definitely the one to blame:
1) https://github.com/llvm/llvm-project/issues/150263
2) https://github.com/llvm/llvm-project/issues/141488
Another example is hard-coded 4 KiB page size which effectively kneecaps ISA when compared against ARM.
All of those things are solved with modern extensions. It's like comparing pre-MMX x86 code with modern x86. Misaligned loads and stores are Zicclsm, bit manipulation is Zb[abcs], atomic memory operations are made mandatory in Ziccamoa.
All of these extensions are mandatory in the RVA22 and RVA23 profiles and so will be implemented on any up to date RISC-V core. It's definitely worth setting your compiler target appropriately before making comparisons.
Ubuntu being RVA23 is looking smarter and smarter.
The RISC-V ecosystem being handicapped by backwards compatibility does not make sense at this point.
Every new RISC-V board is going to be RVA23 capable. Now is the time to draw a line in the sand.
I’d be kind of depressed if every new RISC-V board was not RVA23 capable.
But RISC-V is a _new_ ISA. Why did we start out with the wrong design that now needs a bunch of extensions? RISC-V should have taken the learnings from x86 and ARM but instead they seem to be committing the same mistakes.
I was a bit shocked by headline, given how poorly ARM and x86 compares to RISC-V in speed, cost, and efficiency ... in the MCU space where I near-exclusively live and where RISC-V has near-exclusively lived up until quite recently. RISC-V has been great for RTOS systems and Espressif in particular has pushed MCUs up to a new level where it's become viable to run a designed-from-scratch web server (you better believe we're using vector graphics) on a $5 board that sits on your thumb, but using RISC-V in SBCs and beyond as the primary CPU is a very different ballgame.
It is a reduced instruction set computing isa of course. It shouldn't really have instructions for every edge case.
I only use it for microcontrollers and it's really nice there. But yeah I can imagine it doesn't perform well on bigger stuff. The idea of risc was to put the intelligence in the compiler though, not the silicon.
As proven by x86/x64 and ARM evolution, being all in into pure RISC doesn't pay off, because there is only so much compilers can do in a AOT deployment scenario.
Intentionally. Back then the guys were telling that everything could be solved by raw power.
It was kind of an experiment from start. Some ideas turned out to be good, so we keep them. Some ideas turned out not to be good, so we fix them with extensions.
The problem with hardware expirements is that people owning the hardware are stuck with experiments.
If your hardware is new, you get the nicest extensions though. You just don’t use the bad parts in your code.
Sure, if you are developing software for the computer you own, instead of supporting everyone.
You're correct but I guess my thoughts are if we're going to wind up with a mess of extensions, why not just use x86-64?
First, x86-64 also has “extensions” such as avx, avx2, and avx512. Not all “x86-64” CPUs support the same ones. And you get things like svm on AMD and avx on Intel. Remember 3DNow?
X86-64 also has “profiles” which tell you what extensions should be available. There is x86-64v1 and x86-64v4 with v2 and v3 in the middle.
RVA23 offers a very similar feature-set to x86-64v4.
You do not end up with a mess of extensions. You get RVA23. Yes, RVA23 represents a set of mandatory extensions. The important thing is that two RVA23 compliant chips will implement the same ones.
But the most important point is that you cannot “just use x86-64”. Only Intel and AMD can do that. Anybody can build a RISC-V chip. You do not need permission.
1. Yes, but most of the code would run on anything older than 2007. 20 years of stable ISA.
2. Also, fundamentally all modern CPUs are still 64-bit version of 80386. MMU, protection, low level details are all same.
Because the ISA is not encumbered the way other ISAs are legally, and there are use cases where the minimal profile is fine for the sake of embedded whatever vs the cost to implement the extensions
> why not just use x86-64?
Uh, because you can't? It's not open in any meaningful sense.
The original amd64 came out in 2003. Any patents on the original instruction set have long expired, and even more so for 32-bit x86.
Its not about patents. Believe what you want but there is a reason nobody else is doing x86 or ARM chips unless they are allowed by the owner.
What about page size?
RISC-V has the Svnapot extension for large page sizes https://riscv.github.io/riscv-unified-db/manual/html/isa/isa...
It's 4k on x86 as well. Doesn't seem to hurt so bad -- at least, not enough to explain the risc-v performance gap.
Hmm? x86 has supported much larger “huge” page sizes for ages.
Yes, and Linux. at least historically, has not used them without explicit program opt-in. Often advice is to disable transparent huge pages for performance reasons. Not sure about other operating systems.
See, for example, https://www.pingcap.com/blog/transparent-huge-pages-why-we-d...
Huh, no? The usual advice is to enable THPs for performance, you only disable them in specific scenarios.
>Misaligned loads and stores are Zicclsm
Nope. See https://github.com/llvm/llvm-project/issues/110454 which was linked in the first issue. The spec authors have managed to made a mess even here.
Now they want to introduce yet another (sic!) extension Oilsm... It maaaaaay become part of RVA30, so in the best case scenario it will be decades before we will be able to rely on it widely (especially considering that RVA23 is likely to become heavily entrenched as "the default").
IMO the spec authors should've mandated that the base load/store instructions work only with aligned pointers and introduced misaligned instructions in a separate early extension. (After all, passing a misaligned pointer where your code does not expect it is a correctness issue.) But I would've been fine as well if they mandated that misaligned pointers should be always accepted. Instead we have to deal the terrible middle ground.
>atomic memory operations are made mandatory in Ziccamoa
In other words, forget about potential performance advantages of load-link/store-conditional instructions. `compare_exchange` and `compare_exchange_weak` will always compile into the same instructions.
And I guess you are fine with the page size part. I know there are huge-page-like proposals, but they do not resolve the fundamental issue.
I have other minor performance-related nits such `seed` CSR being allowed to produce poor quality entropy which means that we have bring a whole CSPRNG if we want to generate a cryptographic key or nonce on a low-powered micro-controller.
By no means I consider myself a RISC-V expert, if anything my familiarity with the ISA as a systems language programmer is quite shallow, but the number of accumulated disappointments even from such shallow familiarity has cooled my enthusiasm for RISC-V quite significantly.
I think having separate unaligned load/store instructions would be a much worse design, not least because they use a lot of the opcode space. I don't understand why you don't just have an option to not generate misaligned loads for people that happen to be running on CPUs where it's really slow. You don't need to wait for a profile for that.
As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient. By the time you get to CPUs where portable code is important a CSPRNG is probably fine.
I agree about page size though. Svnapot seems overly complicated and gives only a fraction of the advantages of actually bigger pages.
The option to generate or not generate misaligned loads/stores does exist (-mno-strict-align / -mstrict-align). But of course that's a compile-time option, and of course the preferred state would be to have use of them on by default, but RVA23 doesn't sufficiently guarantee/encourage them not being unreasonably-slow, leaving native misaligned loads/stores still effectively-unusable (and off by default on clang/gcc on -march=rva23u64).
aka, Zilssm / RVA23 are entirely-useless as far as actually getting to make use of native misaligned loads/stores goes.
> RVA23 doesn't guatantee them not being unreasonably-slow
Right but it doesn't guarantee that anything is unreasonably slow does it? I am free to make an RVA23 compliant CPU with a div instruction that takes 10k cycles. Does that mean LLVM won't output div? At some point you're left with either -mcpu=<specific cpu> and falling back to reasonable assumptions about the actual hardware landscape.
Do ARM or x86 make any guarantees about the performance of misaligned loads/stores? I couldn't find anything.
I don't think x86/ARM particularly guarantee fastness, but at least they effectively encourage making use of them via their contributions to compilers that do. They also don't really need to given that they mostly control who can make hardware anyway.
Indeed one can make any instruction take basically-forever, but I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation (on worst-case inputs), else having said instruction is strictly-redundant.
And if any significant general-purpose hardware actually did a 10k-cycle div around the time the respective compiler defaults were decided, I think there's a good chance that software would have defaulted to calling division through a function such that an implementation can be picked depending on the running hardware. (let's ignore whether 10k-cycle-division and general-purpose-hardware would ever go together... but misaligned-mem-ops+general-purpose-hardware definitely do)
RISC-V is not particularly good at using opcode space, unfortunately.
I don't think it's too bad. The compressed extension was arguably a mistake (and shouldn't be in RVA23 IMO), but apart from that there aren't any major blunders. You're probably thinking about how JAL(R) basically always uses x1/x5 (or whatever it is), but I don't think that's a huge deal.
About 1/3 of the opcode space is used currently so there's a decent amount of space left.
Unaligned load/store is a horrible feature to implement.
Page size can be easily extended down the line without breaking changes.
Regarding misaligned reads, IIRC only x86 hides non-aligned memory access. It's still slower than aligned reads. Other processors just fault, so it would make sense to do the same on riscv.
The problem is decades of software being written on a chip that from the outside appears not to care.
Yes, unaligned loads/stores are a niche feature that has huge implications in processor design - loads across cache-lines with different residency, pages that fault etc.
This is the classic conundrum of legacy system redesign - if customers keep demanding every feature of the old system be present, and work the exact same then the new system will take on the baggage it was designed to get rid of.
The new implementation will be slow and buggy by this standard and nobody will use it.
ARM Cortex-A cores also allow unaligned access (MCU cores don't though, and older ARM is weird). There's perhaps a hint if the two most popular CPU architectures have ended up in the forgiving approach to unaligned access, rather than the penalising approach of raising an interrupt.
On modern CPUs, it used not to be something to care about in the past across 8, 16, 32 bit generations, outside RISC.
PDP-11, m68k – to name a few, did not allow misaligned access to anything that was not a byte.
Neither are RISC nor modern.
In regards to 68000 I don't remember, only used it during demoscene coding parties when allowed to touch Amiga from my friends.
I have only seen PDP-11 Assembly snippets in UNIX related books, wasn't aware of its alignment requirements.
Also the bit manipulation extension wasn't part of the core. So things like bit rotation is slow for no good reason, if you want portable code. Why? Who knows.
> Also the bit manipulation extension wasn't part of the core.
This is primarily because core is primarily a teaching ISA. One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used. Anything powerful to run (a non built from source manually) linux will support a profile that bundles all the commonly needed instructions to be fast.
Bit manipulation instructions are part and parcel of any curriculum that teaches CPU architecture. They are the basic building blocks for many more complex instructions.
https://five-embeddev.com/riscv-bitmanip/1.0.0/bitmanip.html
I can see quite a few items on that list that imnsho should have been included in the core and for the life of me I can't see the rationale behind leaving them out. Even the most basic 8 bit CPU had various shifts and rolls baked in.
This is the reason behind the profiles like RVA23 which include bitmanip, vector and a large number of other extensions. Real chips coming very soon will all be RVA23.
Neat. I can't wait to get my hands on a devboard.
The earlierst I know of coming is the SpaceMit K3, which Sipeed will have dev boards for.
32-bit barrel shifters consume significant area and RISC-V was developed to support resource constrained low cost embedded hardware in a minimal ISA implementation.
It was the case even 15 years ago when Cortex M0/M3 really started to get traction, that the processor area of ARM cores was small enough to not make a difference in practice.
The 32-bit ARM architecture included a barrel shifter as part of its basic design, as in every instruction had a shift field.
If a CPU built in 1985 with a grand total of 26 000 transistors could afford it, I am pretty sure that anything built in this century could afford it too.
26k is a lot of transistors for an embedded MCU.
You'd be excluding many small CPUs which exist within other chips running very specialized code.
As profiles mandate these instructions anyway, there's no good reason to complicate the most basic RISC-V possible.
RISC-V is the ISA for everything, from the smallest such CPUs to supercomputers.
What MCUs are you thinking of?
To the best of my knowledge (and Google-fu), 26K really isn't a lot of transistors for an embedded MCU - at least not a fully-featured 32-bit one comparable to a minimal RISC-V core. An ARM Cortex M0, which is pretty much the smallest thing out there, is around 10K gates => around 40K transistors. This is also around the same size as a minimal RISC-V core AFAICT.
The ARM core has a shifter, though.