C++26 Shipped a SIMD Library Nobody Asked For

lucisqr.substack.com

107 points by signa11 2 days ago

I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.

I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?

The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.

mgaunard - 5 hours ago

For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.
- jandrewrogers - 4 hours ago
  
  For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.
  In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
  - mgaunard - 3 hours ago
    
    For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.
    That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.
    We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.
  - mattip - 4 hours ago
    
    NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?
  - camel-cdr - 3 hours ago
    
    The data layout can often be done dynamically based on your target architecture.
  - - 4 hours ago
    
    [deleted]
- portly - 42 minutes ago
  
  Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.
mpyne - 4 hours ago

> I think a legitimate criticism is that it is unclear who std::simd is for.
I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
- jandrewrogers - 4 hours ago
  
  You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.
  The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.
  An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.
  The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.
  SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.
  That said, I love that silicon has become so much more expressive.
  - camel-cdr - 2 hours ago
    
    IMO what's needed is ISPC like guided autovec with a lot of hinting support to control codegen (e.g. hint for generating an unrolled version only or an unrolled and non-unrolled version).
    Basically something like #pragma omp SIMD, but actually designed for the SIMD model, not parallel one, that erros when vectorization isn't possible.
    Ideally it would support things like reductions, scans, reference of elements from other iterations (e.g. out[i] = in[i-1]+in[i+1]), full gather scatter, early break, conditional execution control (masking or also a fast-path, when no active elements), latency vs throughput sensitive (don't unroll or unroll to max without spilling), data dependent termination (fault-only-first load or page aligned for thigs like strlen), ...
- duped - 3 hours ago
  
  > it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions
  This is one complaint I toss back at Intel and AMD.
  If an instruction/intrinsic is universally worse than the P90/P95/P99 use case where it's going to be used to another set of instrinsics, then it shouldn't exist. Stop wasting the die space and instruction decode on it, if not only the developer time wasted finding out that your dot product instruction is useless.
  There are a lot of smart people that have worked on compilers, optimized subroutines for LAPACK/BLAS, and designed the decoders and hardware. A lot of that effort is wasted because no one knows how to program these weird little machines. A little manual on "here's how to program SIMD, starting from linear algebra basics" would be worth more to Intel than all the money they've wasted trying to improve autovectorization passes in ICC and now, LLVM.
cortesoft - 4 hours ago

Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?
- jandrewrogers - 3 hours ago
  
  The full scope of what SIMD is used for is much larger than parallelizing evaluation of numeric types and algorithms.
  For example, it is used for parallel evaluation of complex constraints on unrelated types simultaneously while packed into a single vector. Think a WHERE clause on an arbitrary SQL schema evaluated in full parallel in a handful of clock cycles. SIMD turns out to be brilliant for this but it looks nothing like auto-vectorization.
  None of the SIMD libraries like Google Highway cover this case.
  - camel-cdr - 2 hours ago
    
    I don't quite get how something like highway doesn't cover this, while intrinsics do.
    Can you explain the usecase more concretely?
    
    jandrewrogers - an hour ago
    
    Almost literally what I stated. Consider a row in Postgres table or similar. Convert the entire WHERE clause across all columns in that table into a very short sequence of SIMD instructions against the same memory. All of the columns, regardless of type, are evaluated simultaneously using SIMD. For many complex constraints you can match rows in single digit clock cycles even across many unrelated types. This is much faster than using secondary indexes in many cases.
    It isn’t hypothetical, I’ve shipped systems that worked this way. You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth.
    
    camel-cdr - an hour ago
    
    OK, I thought it couldn't be that, because that should be doable with std::simd or a SIMD abstraction. Well, unless you JIT it, in which case intrinsics wouldn't help either.
    > You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth
    Do I underatand it correctly, that this would only work, if you have multiple of the same comparisons (e.g. equality check with same sized data) in the WHERE clause and the relevant collumns are within one multiple of the SIMD width of each other?
    
    jandrewrogers - 39 minutes ago
    
    Every column has its own independent constraint: equality, order, range intersection, bit sets, etc that is evaluated concurrently in single operations. Independent per column in parallel. It does require handling the representation of columns to enable it but that isn’t onerous in practice.
    It isn’t intuitive but it is one of those things that is obvious in hindsight once you see how it works. The gap is that people struggle to understand how to make this something SIMD native, especially in high-performance systems.
    
    camel-cdr - 13 minutes ago
    
    Ah, so you're just doing SoA or AoSoA layout? It sounded like you where doing something more special than the standard SIMD usecase.
    This does easily work with SIMD abstractions and even length-agnostic vector ISAs, unless you're doing AoSoA and your storage format has to match your memory format and it has the be the same on all machines. In which case you probably want to do something like 4K blocks anyways, in which case you can make it agnostic for all vector length anybody reasonably cares about for this type of application anyways.
- loeg - 4 hours ago
  
  Google Highway gets mentioned in the article.
- mattip - 4 hours ago
  
  There is google’s highway, that provides an abstraction layer. It is used by NumPy.
synergy20 - 3 hours ago

what about Google highway project?
paulddraper - 4 hours ago

> I think a legitimate criticism is that it is unclear who std::simd is for
It's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
- jandrewrogers - 3 hours ago
  
  My criticism could mostly be summarized similarly. The scope of what a portable std::simd can do is almost exactly the scope that you would expect auto-vectorization to subsume over time. SIMD, to the extent it is covered by std::simd, is the part of SIMD that should be pretty simple to learn.
  There isn’t an obvious path to elevate it above what auto-vectorization should theoretically be capable of in a portable way. This leads to a potential long-term outcome where std::simd is essentially a no-op because scalar code is automagically converted into the equivalent and it is incapable of supporting more sophisticated SIMD code.
kent-tokyo - 3 hours ago

[dead]

mgaunard - 4 hours ago

I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).

Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.

There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.

camel-cdr - 2 hours ago

> There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language.
I think this is the best solution for truely portable SIMD. Sure it doesn't cover everything, but it makes autovec explicit, guaranteed and more powerfull.
One of the biggest problems with "portable" SIMD libraries, is that when it's used for simple things, often autovec is better, as it has access to the direct ISA semantics and can much easier do things like unrolling.
rbanffy - 4 hours ago

To me it’s clear adding the ability to express intent to parallelise is the Right Thing. This is the only way the compiler can actually know what you want it to do.
AlotOfReading - 4 hours ago

Trying to abstract over SVE with a SIMD library is a bit of a fool's errand. The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it. All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
Frankly, the length agnostic stuff is a mistake that I hope hardware designers will eventually see the light on, like delay slots.
- camel-cdr - 2 hours ago
  
  > Trying to abstract over SVE with a SIMD library is a bit of a fool's errand
  It reallt isn't. You just make the default SIMD-width agnostic and anything less portable opt-in.
  You can still specialize for a specific width pn scalabe vector ISAs.
  > The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it.
  Such as?
  > All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
  google highway doesn't. And while Arm is stuck with 128-bit SVE, because they alsp have to implement NEON as fast as possible to be competitive, RVV already has a large diversitly of hardware with different vector length available 128,256,512,1024.
  - AlotOfReading - an hour ago
    
    Such as?
    I have a database that has big columns that get functions applied to them to compute the result set. This is a perfect case for length agnostic instructions, except out ends up horribly memory bound. A nice optimization is to only compute those lanes containing rows that might actually be in the result set by keeping track of a sparse record that depends on the lane size. But the cnt instructions are optional, and this also inhibits compiler optimizations in that lookup.
    
    camel-cdr - an hour ago
    
    CNT and CNTP don't seem to be optional for SVE, from what I found. (unless you mean HISTCNT)
    It seems to me like you want tp use CNTP on a bitset that tells you, which rows are relevant, skipping them if CNT is 0? Is that what you where describing?
- stephbook - 2 hours ago
  
  I'm no C++ dev, but as an outsider, it sure reads like the whole "int is variable length" mistake again.
  - pjmlp - 2 hours ago
    
    That abstraction is occasionally usable in low level systems code, that is why Go, Rust, D and C# support it as well.
    Also to note that is C not C++.
  - IshKebab - an hour ago
    
    In a way it's worse because at least with int you're not really expecting to run the same binary on architectures with different int lengths, and also for several decades there have only been two realistic options (32 or 64), which makes it easy to deal with.
    With RVV (and SVE I assume) there are a wider range of realistic options - at least 128, 256 and 512. The RVV spec allows up to 65536! Also it's totally reasonable to want a single binary to work with all of them so then you're into compiling parts of your code multiple times with runtime dispatch which is a right pain.
    I haven't looked into how Highway does it but I don't really know you you write length-agnostic code in high level languages. It's easy in assembly, but it sucks if you have to do it in assembly.
    
    camel-cdr - 6 minutes ago
    
    Here is a highway example: https://gcc.godbolt.org/z/7sdPr61W6
    There is a bit of boilerplate to get dynamic dispatch working, but apart from that it's quite simple to use.

magicalhippo - 2 days ago

The linked[1] "six reasons to use std::simd" was just what I needed after a long week. Hilarious!

[1]: https://github.com/NoNaeAbC/std_simd

boring-human - 41 minutes ago

It should have been "eight reasons to use std::simd". Inefficient.
mgaunard - 5 hours ago

isn't that just QoI issues? There's a reason why the libstdc++ folks labelled their implementation as experimental.
AlotOfReading - 5 hours ago

That certainly convinced me. When I was doing my taxes recently and had to watch those forced loading animations, I kept asking myself "why can't my compiler do this?" Thanks to std::simd, now it can!

ozgrakkurt - 23 minutes ago

Just write inline asm for x86 and aarch64 (if you care about that) and not care about the rest. Is it even useful to do simd on other processors?

Compiler optimizing even the code around the simd code based on the semantics of arithmetic or other things sounds silly after writing some of this kind of code

camel-cdr - 20 minutes ago

So you "just" write 4 assembly implementations?

countWSS - 3 hours ago

GCC already solved it: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html The operations behave like C++ valarrays. Addition is defined as the addition of the corresponding elements of the operands. For example, in the code below, each of the 4 elements in a is added to the corresponding 4 elements in b and the resulting vector is stored in c.

wahern - an hour ago

Those type attributes are also used for the x86 intrinsics API, and they override default C behaviors like promotions and presumptions around aliasing (ironically they make type punning easier, though maybe it was just the few use cases I explored, and this isn't an area where I have alot of experience). C23 also gained the _BitInt type, which discards all the old promotion rules, which should help autovectorization.
I think ISPC is still the proper way to go. But these days everybody wants One Language to Rule Them All along with standard libraries for doing everything out-of-the-box. And while in principle ISPC's approach could be stitched into C or C++ in a fairly clean manner (perhaps with well-defined and enforced segregation of constructs to minimize complexity), it's just not gonna happen: C++ is too enamored with constructing libraries through deeply complex templated types (hammer, nail, yada yada), and C is just too conservative (though if GCC or clang went the distance with a full implementation, there's a good chance the C committee would adopt it).
Someone - 2 hours ago

And these are also available in clang. https://clang.llvm.org/docs/LanguageExtensions.html#vectors-...:
“Vectors and Extended Vectors
Supports the GCC, OpenCL, AltiVec, NEON, SVE and RVV vector extensions”
groundzeros2015 - 2 hours ago

Thanks!

jcranmer - 2 hours ago

If you thought std::simd was a library nobody asked for, just wait until you hear about <linalg>. I feel like half the people looking forward to that think they're just going to get standard C++ bindings to LAPACK, when instead they're probably going to get an unoptimized, slapdash implementation of LAPACK written by people who aren't good at BLAS.

As for SIMD itself, designing a good SIMD library is difficult because there are several different SIMD approaches and some of them work poorly for certain use cases. For example, you can take an HPC-ish approach of "vectorize this loop" (à la #pragma omp simd) and have the compiler take care of a fairly mechanical transformation. Or you can take an opposite approach of treating a 128-bit SIMD vector as a fundamental data type in your language. Which approach is better depends on your use case.

pwdisswordfishq - 39 minutes ago

Just wait until you hear about std::hive.
The work of one obsessive author, who never gave a good explanation for why the thing needed to be in the standard library instead of an external one. The committee was apathetic about the proposal and kept bringing up various trivial issues, in a clear attempt to stall him, but he refused to take the hint. So eventually they relented. Outside coverage I have seen so far seems to be to the tune of "WTF is this weird thing?" and quickly glosses over it.
I wonder if it's going to end up like the export keyword.

- an hour ago

[deleted]

zombot - an hour ago

The article's point in a nutshell:

> The problem is that std::simd in 2026 is the 2012 solution arriving after the world moved on. The committee spent a decade polishing a library-based approach while compilers solved the easy cases automatically and ISPC solved the hard cases with language-level support.

I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?

plasticeagle - 2 hours ago

Nobody should read that AI slop article. Nobody.

Maybe there's an interesting story in there, it's certainly possible. But the "author" could not be bothered to write it, and so why should we waster our time reading it?

pjmlp - 2 hours ago

I love people praise Claude for doing their work, every day on HN, while at the same time complaining about AI in articles.
- cenamus - an hour ago
  
  Who says these are the same people?
  - pjmlp - an hour ago
    
    Statistics.
- bakugo - an hour ago
  
  Glad to see the classic goomba fallacy in action even here on HN.
  - jjmarr - an hour ago
    
    I praise Claude and hate AI articles because I could've asked Claude to dumb down the debate if I wanted.
    Articles should be high information density and summarizable with Claude.
usrnm - 2 hours ago

I read it and found it interesting
chris_wot - 2 hours ago

... because it makes some decent points?

raverbashing - 27 minutes ago

sigh

C++ sits on that weird abstraction level where it wants to be a higher level language but it keeps grinding their gears on stuff like pointer sizes, pointer arithmetic or vector sizes and at the same time wants to keep being C compatible and needs that interface with the lower level world

Now compare with how numpy does things: you care about the data size but not the implementation.

Still, I didn't expect less (of a crap fest) from the C++ committee as presented here

fithisux - an hour ago

Why not just writing inline assembly is not enough?

You optimize for a specific target.

The problem is that you cannot be cross-platform. Sure.

But that is why software is incremental.

I write for my HW, not yours. You can write for yours.

Make folders with implemntations

x86_v1 x86_v2 arm64 riscv64 ... ... ...

and include

ori_b - 2 hours ago

Slop.