Discussion
erichocean: Ideal HN content, thanks!
drsopp: Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.
Pannoniae: Microbenchmarks. A LUT will win many of them but you pessimise the rest of the code. So unless a significant (read: 20+%) portion of your code goes into the LUT, there isn't that much point to bother. For almost any pure calculation without I/O, it's better to do the arithmetic than to do memory access.
LegionMammal978: In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)[0] https://en.wikipedia.org/wiki/Remez_algorithm
AlotOfReading: I'm pretty sure it's not faster, but it was fun to write: float asin(float x) { float x2 = 1.0f-fabs(x); u32 i = bitcast(x2); i = 0x5f3759df - (i>>1); float inv = bitcast(i); return copysign(pi/2-pi/2*(x2*inv),x); } Courtesy of evil floating point bithacks.
moffkalast: > float asinine(float x) {FTFY :P
stephc_int13: My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.
patchnull: The huge gap between Intel (1.5x) and M4 (1.02x) speedups is the most interesting result here. Apple almost certainly uses similar polynomial approximations inside their libm already, tuned for the M-series pipeline. glibc on x86 tends to be more conservative with precision, leaving more room on the table. The Cg version is from Abramowitz and Stegun formula 4.4.45, which has been a staple in shader math for decades. Funny how knowledge gets siloed, game devs and GPU folks have known about this class of approximation forever but it rarely crosses into general systems programming.
stephencanon: These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question here, but no one would ever use it for a system library; less than half the bits are good.Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.
Pannoniae: Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.
jcalvinowens: Surely the loss in precision of a 32KB LUT for double precision asin() would be unacceptable?
Jyaif: By smartly (= not just linearly) interpolating between values you can get excellent results with LUTs much smaller than 32KB.
jason_s: Not sure I would call Remez "simple"... it's all relative; I prefer Chebyshev approximation which is simpler than Remez.
stephencanon: Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.
herf: They teach a lot of Taylor/Maclaurin series in Math classes (and trig functions are sometimes called "CORDIC" which is an old method too) but these are not used much in actual FPUs and libraries. Maybe we should update the curricula so people know better ways.
jcalvinowens: I'm very skeptical you wouldn't get perceptible visual artifacts if you rounded the trig functions to 4096 linear approximations. But I'd be happy to be proven wrong :)
adampunk: // what the fuck
ok123456: Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.
drsopp: I experimented a bit with the code. Various tables with different datatypes. There is enough noise from the Monte Carlo to not make a difference if you use smaller data types than double or float. Even dropping interpolation worked fine, and got the speed to be on par with the best in the article, but not faster.