Fast calculation of square root and its multiplicative inverse in float

Asked

Viewed 180 times

0

Knowing the algorithm "0x5F3759DF" (or "fast inverse square root"), we easily imagine a gross number of variants of it to calculate square root and inverse multiplicative of it in single-Precision floating-point, each with its speed and precision.

Sometimes they don’t compensate, like when you have intrinsic inline that access machine instructions, but there are other times that there is no such feature and then these algorithms are useful. It may be that even better ways to calculate these values are known.

What are the best known algorithms of sqrt and invsqrt in terms of accuracy and performance?

1 answer

1


Introducing

Algorithms derived from 0x5F3759DF begin by using the so-called "magic numbers" to form integer binary code float which will approximate the solution and can then refine the accuracy with floating point operations, normally using the Newton method up to three times (most often only once or twice).

In addition, in order to perform better, specific cases are no longer treated (arguments close to zero, negative, Nan, etc.) and focus on more common cases, reserving these treatments to the application according to needs.

Algorithms without refined approximations

The fastest way to calculate sqrt and invsqrt only makes a sum/subtraction and a bit shift with binary code of the argument in integer form (I prefer without signal), being so very fast, but little precise.

The following algorithm calculates sqrt of a radicando (mathematically, radicando is the argument of a square root, cubic, any root in any index) with relative error below 3.475% under normal conditions. When the root is zero, the result is 7.933e-20, resulting in that same absolute error in this case. For convenience, I’m going to assume that the execution of this algorithm takes one sqrtTime.

uint code=( 0x3F769E5C + *(uint*)&radicand )>>1 ;
return *(float*)&code ;

Already the following calculates invsqrt of a root with a relative error of less than 3,422%. When the root is infinite, the result is 5,239e-20, resulting in that same absolute error in this case. It costs about one sqrtTime even.

uint code=( 0xBE6EC85F - *(uint*)&radicand )>>1 ;
return *(float*)&code ;

Algorithms with an iteration of Newton method

In fact, relative errors above 1% are inconvenient. Because of this, at least once an iteration of the Newton method adapted to the specific function is applied to these values.

In the case of square root, this consists of calculating NewApproach=( OldApproach + Radicand/OldApproach )/2, in the case of the reverse, NewApproach = OldApproach*( 3 - OldApproach*OldApproach*Radicand )/2.

Besides, since [( √r1=Sqrt )]&&[( r2^(-1/2)=Sqrt )]=>[( r1=1/r2 )], can also be applied in reverse the formula NewApproach=( OldApproach + 1/( Radicand*OldApproach ) )/2, which is a smooth modification of the Newton method to refine the square root.

Therefore, we have to follow an algorithm of sqrt which has a relative error of less than 6.011e-4 and, according to my tests, costs at least 2.8*sqrtTime.

uint code=( 0x3F76CF5E + *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
return 0.5f*( root+radicand/root ) ;

Now an algorithm of invsqrt which has a relative error of less than 1.752e-3 and costs at least 2.5*sqrtTime.

uint code=( 0xBE6EB50D - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
return root*( 1.5f + root*root*( radicand*-0.5f ) ) ;

And finally an algorithm of invsqrt which has a relative error of less than 5.895e-4 and costs at least 3.3*sqrtTime.

uint code=( 0xBE6EB50D - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
return 0.5f*( root+1/( radicand*root ) ) ;

Algorithms with two iterations of Newton method

Normally, it is considered that more than two iterations have high cost/benefit, so let’s stop here with the iterations. With two of them, it is clear that sometimes it is possible to simplify operations and also that invsqrt may use twice one of the two formulas or use once each. If you use one of each, in terms of performance little difference makes which comes first, but the accuracy is greater when you first use the most accurate.

We have following an algorithm of sqrt that has relative error less than 1.805e-7 and, according to my tests, it costs at least 3.9*sqrtTime.

uint code = ( 0x3F76CF5E + *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root += radicand/root ;
return 0.25f*root + radicand/root ;

Now we have to follow an algorithm of invsqrt that has relative error less than 4.598e-6 and, according to my tests, costs at least 4.3*sqrtTime. It is not known why this algorithm is slower than the previous one, and yet the removal of an iteration of each one makes it faster. May result from simplifications in the other.

uint code = ( 0xBE6EB50D - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root *= 1.5f + root*root*( radicand*=-0.5f ) ;
return root*( 1.5f + root*root*radicand ) ;

Now, another algorithm of invsqrt. This has a relative error less than 5.213e-7 and, according to my tests, it costs at least 4.4*sqrtTime.

uint code = ( 0xBE6F02E3 - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root = root+1/( radicand*root ) ;
return root*( 0.75f + root*root*( radicand*-0.0625f ) ) ;

And finally the last algorithm of invsqrt with iterations. This has relative error less than 1.737e-7 and, according to my tests, it costs at least 5.2*sqrtTime. It is perceived in relation to the previous one an enormous loss of performance and tiny gain of precision, what makes this algorithm less recommended.

uint code = ( 0xBE6F02E3 - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root = root+1/( radicand*root ) ;
return 0.25f*( root+4/( radicand*root ) ) ;

Algorithms with tables of magic numbers

Finally, let us consider the possibility of refining the accuracy without operations with float but with a selection of distinct formulas in the determination of binary code via table of magic numbers in array. Tables of various sizes can be made, but it is considered appropriate to use tables with 1024 integer numbers without 32 bit signal.

The following two algorithms I developed take a part of the binary root code and use as index tables of terms close to 0x3F769E5C and 0xBE6EC85F (used in the first two algorithms, one of sqrt and another of invsqrt) and also as an index of a table of factors that adjust the function derivative.

Note that factors are used in a product like Factor*CodePiece/2^32 which involve whole numbers of 32 bits, which in good compilers is easily simplified in benefit of machine instructions.

First, the algorithm of sqrt error below 2.008e-7 and cost at least 2.5*sqrtTime according to the measurements I did. Use the tables SqrtCodeTerms and SqrtCodeFactors.

uint code=*(uint*)&radicand , piece=code&0x00FFFFFF , index=piece>>14 ;
code=( (uint)(( (int)SqrtCodeFactors[index]*(long)piece )>>32)+SqrtCodeTerms[index]+code )>>1 ;
return *(float*)&code ;

And the algorithm of invsqrt error below 4.097e-7 and cost of at least 2.7*sqrtTime. It is not known the reason of the difference in performance with the previous algorithm. It makes use of the tables InvSqrtCodeTerms and InvSqrtCodeFactors.

uint code=*(uint*)&radicand , piece=code&0x00FFFFFF , index=piece>>14 ;
code=(uint)(( (int)InvSqrtCodeFactors[index]*(long)piece )>>32)+(( InvSqrtCodeTerms[index]-code )>>1) ;
return *(float*)&code ;

Finally, these tables were obtained through a mathematical program of maplesoft. With the Maple, we programmed the code that calculated for each index the most suitable values for the algorithms and these values were printed. The tables are in the following links.

Sqrt: Sqrtcodeterms and Sqrtcodefactors
Invsqrt: Invsqrtcodeterms and Invsqrtcodefactors

Browser other questions tagged

You are not signed in. Login or sign up in order to post.