SIMD Perlin Noise: Beating the Compiler with SSE (2014)

62 points by homarp 6 days ago

Did you profile the results with different compilers?

The last time I tried doing this kind of microoptimization for fun, I ended up bundling actual assembly files, because the generated assembly for intrinsics was so variable in performance across compilers it was the only way to get consistent results on many platforms.

jesse__ - 3 days ago

I only build the project this is embedded in with clang, so that's the only compiler I tested.

addaon - 3 days ago

Memories. As a personal project back in... 2003?... I decided to do something similar, implement 4D Perlin Noise in Altivec assembly. The only problem was that I had a G3 iBook; so I would write one instruction of assembly, then write a C function to interpret that assembly, building an interpreter for a very selective subset of PPC w/ Altivec that ran (slooooowly) on the G3. As I recall I got it down to ~200 instructions, and it worked perfectly the first time I ran it on a G4, which was pretty rewarding. Took me more than half a day, though. On an unrelated note, I got an intership with Apple's performance team that summer.

- 3 days ago

[deleted]

jesse__ - 3 days ago

Author here, AMA :)

Keyframe - 3 days ago

pretty sweet! I'm mostly interested in how / what did you do to measure the performance and focus on a function. Is it perf pretty much with hist or visualizer or what?
- jesse__ - 3 days ago
  
  I just called _rdtsc() before and after the noise gen once every iteration, and pushed the sample onto a fixed size buffer .. after some N iterations (4k maybe, can't remember) of samples, computed min/max/avg.
  There's a little project here that I used to benchmark in part 4
  https://github.com/scallyw4g/bonsai_noise_bench
0points - 3 days ago

Nice write-up, and congratulations on the result! Since it's about perlin and performance, have you had a look at opensimplex?
PS. bonsai looks really cool! Checking it out right now
- jesse__ - 3 days ago
  
  I haven't looked at opensimplex. I will when I get around to doing a simplex implementation.
  And thanks for the kind words!
vlovich123 - 3 days ago

Which compiler & optimization settings did you use? Out of curiosity, any idea why the compiler failed to auto-vectorize the loops?
- jesse__ - 3 days ago
  
  Clang -O2
  ..
  -O3 didn't seem to make any appreciable difference.
  Re. the auto-vectorization, I really don't know. I didn't even read the assembly the compiler generated until at least halfway through the process. Generally I've found that you basically can't rely on the compiler auto vectorizing anything, ever, if it actually matters.
jokoon - 3 days ago

you should post the result at the end
and yes, make a benchmark
(although I would not know how to make one, or what reference point to use)
what do you think about the fastnoiselite implementation used in godot?
- jesse__ - 3 days ago
  
  I kinda did post results at the end of part 4 .. I beat the SOTA by 1.8x
  There's a benchmark utility here: https://github.com/scallyw4g/bonsai_noise_bench
  Fastnoise2 is a high quality library. Can't speak to fastnoiselite .. never looked at it.
jackmottatx - 3 days ago

[dead]

dragontamer - 3 days ago

SSE?

It makes sense for 2014 when AVX was not too widely deployed. Today you can double the throughput with AVX and probably get 4x improvement with AVX512.

GPU SIMD is also popular but perlin noise seems too 'small' a problem to be worth traversing PCIe over. But if you had a GPU shader that needed perlin noise as an input, I'd expect GPU to easily use this methodology.

It is worth revisiting how different techniques worked out over the last decade. Strangely enough, CUDA code from 2014 would likely be still workable today (perlin noise doesn't need the new GPU instructions of 4x4 Float16 or Raytracing).

OpenCL IMO is the wrong path though, which is what many would have picked from 2014 era for GPU.

jesse__ - 3 days ago

There are 4 parts.
Spoiler: I went to AVX and beat the state of the art by 1.8x
Also, doing it on the GPU is worth it, if you do large batches.
- dragontamer - 3 days ago
  
  As far as fast RNGs available, Imma just plug an old, incomplete, weekend project for ya....
  https://github.com/dragontamer/AESRand
  Especially because you are already in the AVX domain, a fast AVX RNG that uses like 3 registers should be useful to ya.
  ....
  Yeah I'm pretty sure aesenc these days has more throughout than multiply. (Edit: aesenc, at least a singular round, is largely a 32-bit operation and this has less complexity than a 64-bit multiply. Yeah I know it's over 128-bits but seriously, it's surprising how 'little' AES actually shuffles bits around per round).
  If you are fine with an inferior RNG, you probably can skip one or two instructions I did there. But the 'two rounds of AES' seems to be the minimum to pass PractRand or BigCrush.
  -------
  Today, AES on AVX512 can perform 4x AES in parallel over all 512 bits. But the overall technique I did back then should allow for arbitrary skipping ahead as well. (Ex: thread#0 starts with iteration #0. Thread#1 starts with iteration #1000000. Etc. etc. with consistency because my increment function is simple 64-bit adds, a 64-bit multiply will skip forward easily)
  Alas, I don't think AESENC was ever ported to ymm registers and this your choices are 128-bit AESRAND vs 512-bit AESRAND.
  - jesse__ - 3 days ago
    
    Hey, thanks for the comment! I did actually take a look at the aes instructions and came to the conclusion that they are in fact faster than the hash I used, but I think I'd decided I would have to swizzle the data in a way that was a pain because of how the aes mixdown works (ie it mixes across lanes, so I would have to change the output pattern, if that makeshifts sense)
    Maybe I'll dust it off one day and try again. That seems like it could be an easy win.
    
    dragontamer - 2 days ago
    
    > but I think I'd decided I would have to swizzle the data in a way that was a pain because of how the aes mixdown works (ie it mixes across lanes, so I would have to change the output pattern, if that makeshifts sense)
    100% agree.
    I solved this with a 64-bit x2 SIMD add instruction. State += 0x0305071113171923, which ensures a 1-bit 'carry bit' dependency as well so we have (barely) enough data mixing for lots of cool entropy effects.
    Because this is an odd number (bottom bit is 1), it cycles every 2^64, which should be a sufficient cycle length for most simulations.
    That 1-bit difference was enough to then pass PractRand and BigCrush.
    Don't swizzle the bits. Just add a number across all 128-bits (as 2x 64-bit adds) and bam. We get a lot of lovely RNG properties thanks to AES mixing.
    It's not 'purely' aesenc. I did a few little tidbits that fixed all the problems of AES data mixing.
    ------
    The real fun part is that the latency/dependency limitation on my code is this Add instruction. The AES stuff is done in parallel later and thus easily parallelizes to modern 4x512-bit AES as is available on Zen5. (Maybe the compilers won't see it yet, but it's bloody obvious for humans to see it IMO).
    IE: the critical path of my code is:
    simd-add state, 0x030507.....
    State gets SSA'd by the out of order system on the processors and thus future iterations of the RNG loop can execute in parallel.
    
    jesse__ - 2 days ago
    
    That sounds really great. I'm away from the office for the week but when I'm back I'll take a closer look and maybe squeeze some more juice out of it :)

twoodfin - 3 days ago

The year on this article should be (2024).

KingLancelot - 3 days ago

[dead]

llm_nerd - 3 days ago

HN loves SIMD, and there is a "how I hand crafted a SIMD optimization" post doing numbers on here regularly. They're fun posts, and it absolutely speaks to the fact that writing code that optimizing compilers can robustly and comprehensively turn into good SIMD branches is somewhat of a black art.

Which is why you, generally, shouldn't be doing either. You shouldn't rely upon the compiler to figure out your intentions, and you shouldn't be writing SIMD instructions directly unless you're writing a SIMD library or an optimizing compiler.

Instead you should reach for one of the many available libraries that not only force you into appropriately structuring your data and calls for SIMD goodness, they're massively more portable and powerful.

Google's Highway, for instance, will let you use their abstracted SIMD functions and it provides the optimization whether your target is SSE2-4, AVX, AVX2, AVX512, AVX10, or if you build for ARM NEON or SVE, for any conceivable vector size, or WASM's weird SIMD functions, or RISC-V's RVV, and several more, and when new widths and new options come out, the library adds the support and you might not have to change your code at all.

There are loads of libraries like this (xsimd, EVE, SIMDe, etc). They all force you into thinking about structuring your code in a manner that is SIMDable -- instead of hoping the optimizing compiler will figure it out on its own -- and provide targeting for a vast trove of SIMD options without hand-writing for every option.

I was going to quickly rewrite the example in Highway just to demonstrate but the Perlin stuff seems to be missing or significantly restructured.

"But that is obvious and I'm mad that you commented this" - no, it isn't obvious whatsoever, and this "I hand-rolled some SSE now my app is super awesome look at the microbenchmark results on a very narrow, specific machine" content appears on here regularly, betraying a pretty big influence of beginners who don't know that it's almost certainly the wrong approach.

jesse__ - 3 days ago

I disagree pretty strongly with most of what you said, but I'd be very interested in seeing a Highway example and looking at the differences. Take a look through the comments, I left a link to the test bench I made, which contains all the code.
- janwas - 3 days ago
  
  Highway author here :) I'm curious what you disagree with, because it all sounds very sensible to me?
  - jesse__ - 3 days ago
    
    There's a lot to discuss.
    First off, a number of statements are nonsense. Take, for example
    > you shouldn't be writing SIMD instructions directly unless you're writing a SIMD library or an optimizing compiler.
    Why would writing an optimizing compiler qualify as territory for directly writing SIMD code, but anything else is off the table? That makes no sense at all.
    Furthermore, I was writing a library. It's just embedded in my game engine.
    > Instead you should reach for one of the many available libraries
    This blanket statement is only true in a narrow set of circumstances. In my mind, it requires that you ship on multiple architectures and probably multiple compilers. If you have narrower constraints, it's extremely easy to write your own wrappers (like I did) and not take a dependency. A good trade IMO. Furthermore, someone's got to write the libraries, so doing it yourself as a learning exercise has value.
    > There are loads of libraries like this [...] and provide targeting for a vast trove of SIMD options without hand-writing for every option.
    The original commentor seems to be under the impression that using a SIMD library would somehow have produced a better result. The fact is, the library code is super fucking boring. I barely mentioned it in the article because it's basically just boilerplate an LLM could probably spit out, first try. The interesting part of the series is the observation that you can precompute a matrix of intermediates and look them up, instead of recomputing them in the hot loop, effectively trading memory bandwidth for less instructions. A good trade for this algorithm, which saturates the instruction pipelines.
    The thing the original commentor does get right is the notion that thinking about data layout is important. But, that has nothing to do with the library you're using .. you just have to do it. They seem to be conflating the use of a library with the act of writing wide code, as if you can't do one without the other, which is obviously false.
    > I was going to quickly rewrite the example in Highway ..
    Right. I'll believe this when I see it.
    I could pick it apart more, but.. I think you get my drift.
    
    janwas - 2 days ago
    
    Thanks for expanding on your viewpoint.
    > Why would writing an optimizing compiler qualify as territory for directly writing SIMD code, but anything else is off the table?
    I understood "directly writing" to mean assembly or even intrinsics. In general, I would advise not touching intrinsics directly, because the intrinsic definitions themselves have in several cases had bugs. Here's one AVX-512 example: https://github.com/google/highway/commit/7da2b760c012db04103....
    When using a wrapper library, these can be fixed in one spot, but every direct user of intrinsics has to deal with it themselves.
    > it's extremely easy to write your own wrappers (like I did) and not take a dependency. A good trade IMO
    I understand wanting to reduce dependencies. The tradeoff is a bit more complex: for example many readers would be familiar with Highway terminology. We have also made efforts to be a lightweight dependency :)
    > doing it yourself as a learning exercise has value.
    Understandable :) Though it's a bit regrettable to tie your user code to the library prototype - if used elsewhere, it would probably have to be ported.
    > The fact is, the library code is super fucking boring.
    True for many ops. However, emulating AES or other complex ops is nontrivial. And it is easy to underestimate the sheer toil of keeping things working across compiler versions and their many bugs. We recently hit the 3000 commit mark in Highway :)
    
    jesse__ - 2 days ago
    
    Generally agree, especially with the sentiment that it's a huge PITA maintaining something like this across multiple compilers & platforms.
    Out of curiosity, does highway implement integer divide?
    
    janwas - a day ago
    
    :) Yes indeed, it's about 500 LOC in https://github.com/google/highway/blob/master/hwy/ops/generi....
    
    jesse__ - a day ago
    
    Nice. Looks like it handles quite a bit. I just supported a single div op, which was enough for my needs.
    https://github.com/scallyw4g/bonsai_stdlib/blob/71fadd0f1fce...
    
    llm_nerd - 3 days ago
    
    >First off, a number of statements are nonsense.
    100% of my original comment is absolutely and completely correct. Indisputable correct.
    >Furthermore, I was writing a library.
    Little misunderstandings like this pervade your take.
    >seems to be under the impression that using a SIMD library would somehow have produced a better result.
    To be clear, I wasn't speaking to you or for your benefit, or specifically to your exercise. You'll notice I didn't email a list of recommendations to you, because I do not care what you do or how you do it. I didn't address my comment to you.
    I -- and I was abundantly clear on this -- was speaking to the random reader who might be considering optimizing their code with some hand-crafted SIMD. That following the path in this (and an endless chain of similar) submission(s) is usually ill advised, generally, not even speaking to this specific project, but rather to the average "I want to take advantage of SIMD in my code" consideration.
    HN has a fetish for SIMD code recently and there is almost always a better approach than hand-crafting some SSE3 calls in one's random project.
    >The original commentor seems to be under the impression that using a SIMD library would somehow have produced a better result.
    Again, I could not care less about your project. But the average developer does care that their code runs on a wide variety of platforms optimally. You don't, but again, you and your project was tangential to my comment which was general.
    >The thing the original commentor does get right is the notion that thinking about data layout is important.
    Aside from the entirety of my comment being correct, the point was that many of the SIMD tools and libraries force you down a path where you are coerced into such structures. Versus often relying upon the compiler to make the best of suboptimal structures. We've seen many times where people complain that their compiler isn't vectorizing things that they think it should, but there is a choice between endlessly fighting with the compiler, and hand-rolling SSE calls, that not only supports much more hardware it leads you down the path of best practices.
    Which is of course why C++ 26 is getting std::simd.
    Again, you are irrelevant to my comment. Your project is irrelevant to it. I know this is tough to stomach.
    >Right. I'll believe this when I see it.
    I actually cloned the project but then this submission fell off the front page and it seemed not worth my time. Not to mention that it can't be built on macOS which happened to be the machine I was on at the moment.
    Because again, I don't care about your or your project, and my commentary was to the SIMD sideliners considering how to approach it.
    >I could pick it apart more, but.. I think you get my drift.
    None of your retorts are valid, and my comment stands as completely correct. The drift is that you feel defensive about a general comment because you did something different, which....eh.
    
    janwas - 2 days ago
    
    I appreciate your efforts to nudge readers towards SoA data structures and varying SIMD widths. FWIW I have observed that advice is more effective if communicated with some kindness.
    
    jesse__ - 2 days ago
    
    lol, alright dude. Good luck with C++26
    
    llm_nerd - 2 days ago
    
    Delicious snark. Humorously I only mentioned C++26 because the approach is being formalized right into the standard -- it is so painfully obvious and necessary -- but of course I mentioned a number of existing excellent solutions like Highway already, so again you either have no idea what you're reading, or choose not to.
    Cheers!
63 - 3 days ago

This is a valuable viewpoint that lines up somewhat with some other discussion I've seen on the topic [0]. I'd like to see more posts about structuring code for the auto vectorizor (with libraries or otherwise) rather than writing simd by hand. Do you have any documentation you'd recommend?
[0] https://matklad.github.io/2023/04/09/can-you-trust-a-compile...