SIMD Perlin Noise: Beating the Compiler with SSE (2014)

scallywag.software

62 points by homarp 6 days ago


rincebrain - 3 days ago

Did you profile the results with different compilers?

The last time I tried doing this kind of microoptimization for fun, I ended up bundling actual assembly files, because the generated assembly for intrinsics was so variable in performance across compilers it was the only way to get consistent results on many platforms.

addaon - 3 days ago

Memories. As a personal project back in... 2003?... I decided to do something similar, implement 4D Perlin Noise in Altivec assembly. The only problem was that I had a G3 iBook; so I would write one instruction of assembly, then write a C function to interpret that assembly, building an interpreter for a very selective subset of PPC w/ Altivec that ran (slooooowly) on the G3. As I recall I got it down to ~200 instructions, and it worked perfectly the first time I ran it on a G4, which was pretty rewarding. Took me more than half a day, though. On an unrelated note, I got an intership with Apple's performance team that summer.

jesse__ - 3 days ago

Author here, AMA :)

dragontamer - 3 days ago

SSE?

It makes sense for 2014 when AVX was not too widely deployed. Today you can double the throughput with AVX and probably get 4x improvement with AVX512.

GPU SIMD is also popular but perlin noise seems too 'small' a problem to be worth traversing PCIe over. But if you had a GPU shader that needed perlin noise as an input, I'd expect GPU to easily use this methodology.

It is worth revisiting how different techniques worked out over the last decade. Strangely enough, CUDA code from 2014 would likely be still workable today (perlin noise doesn't need the new GPU instructions of 4x4 Float16 or Raytracing).

OpenCL IMO is the wrong path though, which is what many would have picked from 2014 era for GPU.

twoodfin - 3 days ago

The year on this article should be (2024).

KingLancelot - 3 days ago

[dead]

llm_nerd - 3 days ago

HN loves SIMD, and there is a "how I hand crafted a SIMD optimization" post doing numbers on here regularly. They're fun posts, and it absolutely speaks to the fact that writing code that optimizing compilers can robustly and comprehensively turn into good SIMD branches is somewhat of a black art.

Which is why you, generally, shouldn't be doing either. You shouldn't rely upon the compiler to figure out your intentions, and you shouldn't be writing SIMD instructions directly unless you're writing a SIMD library or an optimizing compiler.

Instead you should reach for one of the many available libraries that not only force you into appropriately structuring your data and calls for SIMD goodness, they're massively more portable and powerful.

Google's Highway, for instance, will let you use their abstracted SIMD functions and it provides the optimization whether your target is SSE2-4, AVX, AVX2, AVX512, AVX10, or if you build for ARM NEON or SVE, for any conceivable vector size, or WASM's weird SIMD functions, or RISC-V's RVV, and several more, and when new widths and new options come out, the library adds the support and you might not have to change your code at all.

There are loads of libraries like this (xsimd, EVE, SIMDe, etc). They all force you into thinking about structuring your code in a manner that is SIMDable -- instead of hoping the optimizing compiler will figure it out on its own -- and provide targeting for a vast trove of SIMD options without hand-writing for every option.

I was going to quickly rewrite the example in Highway just to demonstrate but the Perlin stuff seems to be missing or significantly restructured.

"But that is obvious and I'm mad that you commented this" - no, it isn't obvious whatsoever, and this "I hand-rolled some SSE now my app is super awesome look at the microbenchmark results on a very narrow, specific machine" content appears on here regularly, betraying a pretty big influence of beginners who don't know that it's almost certainly the wrong approach.