Introducing
Algorithms derived from 0x5F3759DF
begin by using the so-called "magic numbers" to form integer binary code float
which will approximate the solution and can then refine the accuracy with floating point operations, normally using the Newton method up to three times (most often only once or twice).
In addition, in order to perform better, specific cases are no longer treated (arguments close to zero, negative, Nan, etc.) and focus on more common cases, reserving these treatments to the application according to needs.
Algorithms without refined approximations
The fastest way to calculate sqrt
and invsqrt
only makes a sum/subtraction and a bit shift with binary code of the argument in integer form (I prefer without signal), being so very fast, but little precise.
The following algorithm calculates sqrt
of a radicando (mathematically, radicando is the argument of a square root, cubic, any root in any index) with relative error below 3.475% under normal conditions. When the root is zero, the result is 7.933e-20, resulting in that same absolute error in this case. For convenience, I’m going to assume that the execution of this algorithm takes one sqrtTime
.
uint code=( 0x3F769E5C + *(uint*)&radicand )>>1 ;
return *(float*)&code ;
Already the following calculates invsqrt
of a root with a relative error of less than 3,422%. When the root is infinite, the result is 5,239e-20, resulting in that same absolute error in this case. It costs about one sqrtTime
even.
uint code=( 0xBE6EC85F - *(uint*)&radicand )>>1 ;
return *(float*)&code ;
Algorithms with an iteration of Newton method
In fact, relative errors above 1% are inconvenient. Because of this, at least once an iteration of the Newton method adapted to the specific function is applied to these values.
In the case of square root, this consists of calculating NewApproach=( OldApproach + Radicand/OldApproach )/2
, in the case of the reverse, NewApproach = OldApproach*( 3 - OldApproach*OldApproach*Radicand )/2
.
Besides, since [( √r1=Sqrt )]&&[( r2^(-1/2)=Sqrt )]=>[( r1=1/r2 )]
, can also be applied in reverse the formula NewApproach=( OldApproach + 1/( Radicand*OldApproach ) )/2
, which is a smooth modification of the Newton method to refine the square root.
Therefore, we have to follow an algorithm of sqrt
which has a relative error of less than 6.011e-4 and, according to my tests, costs at least 2.8*sqrtTime
.
uint code=( 0x3F76CF5E + *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
return 0.5f*( root+radicand/root ) ;
Now an algorithm of invsqrt
which has a relative error of less than 1.752e-3 and costs at least 2.5*sqrtTime
.
uint code=( 0xBE6EB50D - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
return root*( 1.5f + root*root*( radicand*-0.5f ) ) ;
And finally an algorithm of invsqrt
which has a relative error of less than 5.895e-4 and costs at least 3.3*sqrtTime
.
uint code=( 0xBE6EB50D - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
return 0.5f*( root+1/( radicand*root ) ) ;
Algorithms with two iterations of Newton method
Normally, it is considered that more than two iterations have high cost/benefit, so let’s stop here with the iterations. With two of them, it is clear that sometimes it is possible to simplify operations and also that invsqrt
may use twice one of the two formulas or use once each. If you use one of each, in terms of performance little difference makes which comes first, but the accuracy is greater when you first use the most accurate.
We have following an algorithm of sqrt
that has relative error less than 1.805e-7 and, according to my tests, it costs at least 3.9*sqrtTime
.
uint code = ( 0x3F76CF5E + *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root += radicand/root ;
return 0.25f*root + radicand/root ;
Now we have to follow an algorithm of invsqrt
that has relative error less than 4.598e-6 and, according to my tests, costs at least 4.3*sqrtTime
. It is not known why this algorithm is slower than the previous one, and yet the removal of an iteration of each one makes it faster. May result from simplifications in the other.
uint code = ( 0xBE6EB50D - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root *= 1.5f + root*root*( radicand*=-0.5f ) ;
return root*( 1.5f + root*root*radicand ) ;
Now, another algorithm of invsqrt
. This has a relative error less than 5.213e-7 and, according to my tests, it costs at least 4.4*sqrtTime
.
uint code = ( 0xBE6F02E3 - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root = root+1/( radicand*root ) ;
return root*( 0.75f + root*root*( radicand*-0.0625f ) ) ;
And finally the last algorithm of invsqrt
with iterations. This has relative error less than 1.737e-7 and, according to my tests, it costs at least 5.2*sqrtTime
. It is perceived in relation to the previous one an enormous loss of performance and tiny gain of precision, what makes this algorithm less recommended.
uint code = ( 0xBE6F02E3 - *(uint*)&radicand )>>1 ;
float root = *(float*)&code ;
root = root+1/( radicand*root ) ;
return 0.25f*( root+4/( radicand*root ) ) ;
Algorithms with tables of magic numbers
Finally, let us consider the possibility of refining the accuracy without operations with float
but with a selection of distinct formulas in the determination of binary code via table of magic numbers in array. Tables of various sizes can be made, but it is considered appropriate to use tables with 1024 integer numbers without 32 bit signal.
The following two algorithms I developed take a part of the binary root code and use as index tables of terms close to 0x3F769E5C
and 0xBE6EC85F
(used in the first two algorithms, one of sqrt
and another of invsqrt
) and also as an index of a table of factors that adjust the function derivative.
Note that factors are used in a product like Factor*CodePiece/2^32
which involve whole numbers of 32 bits, which in good compilers is easily simplified in benefit of machine instructions.
First, the algorithm of sqrt
error below 2.008e-7 and cost at least 2.5*sqrtTime
according to the measurements I did. Use the tables SqrtCodeTerms
and SqrtCodeFactors
.
uint code=*(uint*)&radicand , piece=code&0x00FFFFFF , index=piece>>14 ;
code=( (uint)(( (int)SqrtCodeFactors[index]*(long)piece )>>32)+SqrtCodeTerms[index]+code )>>1 ;
return *(float*)&code ;
And the algorithm of invsqrt
error below 4.097e-7 and cost of at least 2.7*sqrtTime
. It is not known the reason of the difference in performance with the previous algorithm. It makes use of the tables InvSqrtCodeTerms
and InvSqrtCodeFactors
.
uint code=*(uint*)&radicand , piece=code&0x00FFFFFF , index=piece>>14 ;
code=(uint)(( (int)InvSqrtCodeFactors[index]*(long)piece )>>32)+(( InvSqrtCodeTerms[index]-code )>>1) ;
return *(float*)&code ;
Finally, these tables were obtained through a mathematical program of maplesoft. With the Maple, we programmed the code that calculated for each index the most suitable values for the algorithms and these values were printed. The tables are in the following links.
Sqrt: Sqrtcodeterms and Sqrtcodefactors
Invsqrt: Invsqrtcodeterms and Invsqrtcodefactors