Faster asin() was hiding in plain sight

16bpp.net

260 points by def-pri-pub 5 days ago


jason_s - 5 days ago

While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.

Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.

I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.

xt00 - 4 days ago

To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.

exmadscientist - 4 days ago

This line:

> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.

was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."

LegionMammal978 - 5 days ago

In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)

[0] https://en.wikipedia.org/wiki/Remez_algorithm

AlotOfReading - 5 days ago

I'm pretty sure it's not faster, but it was fun to write:

    float asin(float x) {
      float x2 = 1.0f-fabs(x);
      u32 i = bitcast(x2);
      i = 0x5f3759df - (i>>1);
      float inv = bitcast(i);
      return copysign(pi/2-pi/2*(x2*inv),x);
    }
Courtesy of evil floating point bithacks.
cmovq - 4 days ago

> After all of the above work and that talk in mind, I decided to ask an LLM.

Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.

[1]: https://stackoverflow.com/a/26030435

scottlamb - 5 days ago

Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!

They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.

I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.

https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

orangepanda - 5 days ago

> Nobody likes throwing away work they've done

I like throwing away work I've done. Frees up my mental capacity for other work to throw away.

WithinReason - 4 days ago

> In any graphics application trigonometric functions are frequently used.

Counterpoint from the man himself, "avoiding trigonometry":

https://iquilezles.org/articles/noacos/

sixo - 4 days ago

It appears that the real lesson here was to lean quite a bit more on theory than a programmer's usual roll-your-own heuristic would suggest.

A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.

tzs - 3 days ago

In that Padé approximant I think you can save a couple multiplications.

As written it does this:

  n = 1 - 367/714 * x**2
  d = 1 - 81/119 * x**2 + 183/4760 * x**4
  return x * (n/d)
That's got 7 multiplies (I'm counting divide as a multiply) and 3 additions. (I'm assuming the optimizer only computes x^2 once and computes x^2 by squaring x^2, and that all the constants are calculated at compile time).

Replace n/d with 1/(d/n) and then replace d/n with q + r/n where q is the quotient of polynomial d divided by polynomial n and r is the remainder.

This is the result:

  n = 1 - 367/714 * x**2
  q = 1587627/1346890 - 549/7340 * x**2
  r = -240737/1346890
  return x / (q + r/n)
That's got 5 multiplies and 3 additions.
kazinator - 3 days ago

The glibc implementation already has tests for several ranges and hacks for them including Taylor series:

https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

The smallest range is |x| < 1.49011611938477e-8. In this case the routine just returns x, after calling some underflow-checking routine.

So right there, if we neglect this detail in our own wrapper, we may be able to get a speedup, at the cost of sending very small values to the Tayor series.

The next smallest range tested is |x| < 0.125, and after that |x| < 0.5 and 0.75.

The authors are cetainly not missing any brilliant trick hiding in plain sight; they are doing a more assiduous job.

erichocean - 5 days ago

Ideal HN content, thanks!

djmips - 4 days ago

To the blog poster:

Robin Green is an excellent resource

Faster Math Functions:

https://basesandframes.wordpress.com/wp-content/uploads/2016...

https://basesandframes.wordpress.com/wp-content/uploads/2016...

Even faster math functions GDC 2020:

https://www.gdcvault.com/play/1027337/Math-in-Game-Developme...

adampunk - 5 days ago

We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c

Skeime - 4 days ago

Wouldn't it also be much better to evaluate the Taylor polynomials using Horner's method, instead? (Maybe C++ can do this automatically, but given that there might be rounding differences, it probably won't.)

tyleo - 3 days ago

I had to do an atan() on an slow embedded device once for an autonomous robot competition.

Fastest impl I came up with was rounding and big switch statement.

glitchc - 5 days ago

The 4% improvement doesn't seem like it's worth the effort.

On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.

coloneljelly - 4 days ago

In DSP math, it is common to use Chebyshev polynomial approximation. You can get incredibly precise results within your required bounds.

andyjohnson0 - 4 days ago

Interesting article. A few years back I implemented a bunch of maths primitives, including trig functions, using Taylor sequences etc, to see how it was done. An interesting challenge, even at the elementary level I was working at.

So this article got me wondering how much accuracy is needed before computing a series beats pre-computed lookup tables and interpolation. Anyone got any relevant experience to share?

How much accuracy does ray tracing require?

peterabbitcook - 4 days ago

I am curious, did you check how much your benchmarks moved (time and errors) if at all if you told the compiler to use —-use_fast_math or -ffast-math?

There’s generally not a faster version of inverse trig functions to inline, but it might optimize some other stuff out.

Unrelated to that, I’ve seen implementations (ie julia/base/special/trig) that use a “rational approximation” to asin, did you go down that road at any point?

- 4 days ago
[deleted]
debo_ - 4 days ago

https://bash-org-archive.com/?427792

Ono-Sendai - 4 days ago

Here's my fast acos, which I think can be converted to an asin: https://forwardscattering.org/post/66

empiricus - 4 days ago

Does anyone knows the resources for the algos used in the HW implementations of math functions? I mean the algos inside the CPUs and GPUs. How they make a tradeoff between transistor number, power consumption, cycles, which algos allow this.

veltas - 4 days ago

Just a point that the constexpr/const use in that C++ code makes no difference to output, and is just noise really.

- 5 days ago
[deleted]
drsopp - 5 days ago

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

varispeed - 4 days ago

If you are interested in such "tricks", you should check out the classic Hacker's Delight by Henry Warren

stephc_int13 - 5 days ago

My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.

ok123456 - 4 days ago

Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.

- 5 days ago
[deleted]
patchnull - 4 days ago

[flagged]

patchnull - 5 days ago

[flagged]