About float and double
These are types defined by the IEEE. Their representation is given by the sign, exponent and mantissa. Without taking the details, if you have 3 digits to represent the mantissa:
d0 d1 d2
1 1 0
The value of the mantissa is 11.
11? But I only saw two bits connected, 11 needs 3!
Yes, and has the third bit on. d3
is implicit for normalized numbers. And that bit is always connected under these conditions. The above mantissa is interpreted as if it were the following number:
d0 d1 d2 d3
1 1 0 1
The exponent will result in any number within the range. I do not intend to go into more detail here. Let’s assume the resulting value is e
for the exponent and m
to the mantissa. The final value then is:
m * 2 ^ e
Like m
is a number formed by the bits of the mantissa (shifted to the left to be between 1 and 2), we can rewrite it thus (to i
being the bit position and q
the total of bits):
m = somatório b_i * 2 ^ (i - q)
Then, replacing in the formula above:
somatório b_i * 2 ^ (e + i - q)
That is, every floating point number represented by this scheme is a sum of points of 2. Due to mathematical characteristics, every (finite) sum of powers of 2 has finite representation at base 10, but the opposite is not true. For example, it is impossible to represent 0.2 as a finite sum of powers of 2; you would represent it as a periodic tithe yes, but periodic tithes are not representable in format mantissa * base ^ expoente
, being mantissa
defined by a finite sum.
Since there are numbers that are not representable, they are approximated by good enough numbers. This generates a calculation error.
For each distinct exponent value, there is a distinct error associated with the calculation.
DECIMAL in SQL Server
In SQL Server, the type DECIMAL
serves to indicate fixed point numbers. What does this mean? It means that we are working with whole numbers most of the time. The dwelling is fixed, its accuracy goes up to the least significant digit.
Its general form is:
n * 10 ^ (-s)
Where n
is an integer (32, 64, 128 or 256 bit, according to the chosen precision; reference), and s
is the scale, a positive number. Its accuracy goes up to 10 ^ (-s)
, lower values cannot be represented, therefore they must be rounded or truncated.
The error associated with the calculation is always less than 10 ^ (-s)
, often being mitigated using bank rounding.
Multiplication and division require special treatment in this field. The division will have the rounded or truncated result, as well as a special routine to discard the irrelevant values of multiplication.
BigDecimal
in Java
If you are only interested in calculating, you do not need to know much more than using the methods of this class to calculate.
Generally speaking, it allows a input
of arbitrary size with absurdly high precision.
Underneath the covers, it usually contains a BigInteger
below and a scale. It has the same mathematical representation as the DECIMAL in SQL Server:
n * 10 ^ (-s)
Whereas here n
is a variable integer number (the BigInteger
previously mentioned).
The associated error is less than 10 ^ (-s)
, and it is possible to define the value of s
running to be large enough. Bank rounding further mitigates the error.
Note that here we have a Java class that will do operations that are not directly supported by ULA, which consumes additional processing and memory usage.
Decimal
in C#
I don’t have much to say about it for lack of experience. But from what I read, it looks a lot like the DECIMAL of SQL Server.
How to use each?
If you need precision in the calculation up to a certain scale, regardless of the value being calculated, you are in the case of using a BigDecimal
or equivalent. In a sales system that I support, we use BigDecimal
with an accuracy ranging from 6 to 30 digits (usually 30 for divisions, 6 for all other operations). Our tax values obtained have never been more accurate after migrating 100% of the calculation to these specifications.
float and double are faster, more efficient and more economical than BigDecimal
s of Java; I can’t say much about the Decimal
of C#, but I believe for multiplication to be much lighter. Typically, in modern processors, there is a floating-point arithmetic processing core. Using this type of variable, the error incurred is proportional to the most significant value of the mantissa. This means a value of 1 that accepts an error of 2 ^ -4
means that the value of 0.25 accepts an error of 2 ^ -6
.
Calculation of 30% tax
Let’s put an example of tax calculation to exemplify the error associated with the calculation of both types of data.
Let’s say we sell Persian cats. The tax on them is 30%. Knowing that I sold 72 cats at 524.7500 each, How much should I pay to the government of
Tax?
Applying 30% means multiplying by 0.3.
Java and BigDecimal
30% tax is 30 shifted 2 houses to the left (or 3 shifted one house to the left). As it is an integer, and there was no division, there was no loss of precision. I multiply that by 72, an integer that I can represent without losing precision with the BigDecimal
. 524.7500 is equivalent to 52475 moved two houses to the left. In all, after the multiplications, we will have an unrounded integer/exact integer value shifted four boxes to the left.
Calculus with float
524.75 is represented by the following sum of powers of 2:
512 + 8 + 4 + 0.5 + 0.25
Or else:
2^9 + 2^3 + 2^2 + 2^-1 + 2^-2
We can represent without data loss if there are 11 bits to the mantissa.
Why 11 digits of mantissa?
I was rereading this publication and was in doubt "why 11? Shouldn’t there be 12?" After all, we are working with digits from position 9 to position -2. This results in 12 houses! More specifically, ordered by significance: 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2.
What I myself had forgotten was that the most significant digit in non-denormalized numbers has the implicit value 1. This means that it is not necessary to store the digit from position 9, only those from positions 8 to position -2.
We need to multiply this value by 72, which is an integer value so we trivially know that it is a sum of powers of 2.
The result is then multiplied by 0.3. 0.3 has no finite representation as the sum of powers of 2, so it will be represented by a number close enough but not exact. If you have 11 bits for the mantissa, the accuracy of the result number of the 0.3 representation is 2 ^ -13
, which means that the representative will have an error in the order of 2 ^ -13
.
Saw how in a simple calculation it was possible to insert an error in one representation but not in another?
Infinite Series floating point
If you have a Youtube channel worth following is the Infinite Series. Recently, they went up last week a video talking about how it happens to floating point computing, and one of the examples given is that 0.1 + 0.1 != 0.2
, for 0.1
in binary is a periodic decimate and therefore cannot be represented in scientific notation (using finite mantissa) in base 2. This is one of the main arguments against using floating-point arithmetic to resolve issues involving money.
Emphasizing here, the BigDecimal
Java and many other schemes use integers of arbitrary size to represent mantissas, then place the decimal point at any point in that number. Although finite, as this scheme of calculation allows an arbitrary precision of p
houses (hence error of 5 * 10 ^ -(p+1)
when making calculations), we are sure in the calculation of these amounts (ie, 0.1 + 0.1 = 0.2
when you have p >= 1
) in the chosen precision.
Computerphile floating point
The Computerphile channel ("computadófilo" in free interpretation) comments which floating point number is only scientific notation based on 2, with a limitation of representation. And that this is great for representing quantities as large as the size of the universe and the distance between the atomic nucleus and the orbit of an electron.
In these cases, the scientific notation is beneficial because it can represent significantly the numbers and the errors of the rounding of the calculation are within the expected. The rounding errors of these calculations can sometimes be less than the inherent error of measuring certain quantities (addendum mine, the channel does not comment on this, but it is true yes).
About the errors in the calculations, the example that the presenter provides is in the rendering of 3D graphics of a game. If by chance the rendering of a graphical element is offset one hundredth or one thousandth of a pixel, this error is acceptable and easily ignored in the player’s perception.
At one point in the video, the presenter speaks of an example of floating point financial calculation. Add 0.1
with 0.2
gives an unacceptable calculation error for financial applications. So, he suggests working with integers (in the unit of pennies or a fraction of the pennies) or else using the decimal
coming in your programming language.
DECIMAL
normally do not see as floating point, but as fixed point. For example, in sql server, this is how it works– Jefferson Quesado
At a glance: https://answall.com/questions/11340/que-datatypes-double-float-ou-decimal-eu-should-usar-para-representmo
– Bsalvo
I think the guy
DECIMAL
language-dependent :/– Jefferson Quesado
@Brunocastro will be that it is dup?
– Marconi
Marconi, you requested language independent, the question linked by @Brunocastro is from C#, so the scopes are distinct
– Jefferson Quesado
@Jeffersonquesado yes, but it should be the same for both languages. Everywhere I look I see same calculations.
– Marconi
In Java and C, I always try to make monetary calculations with a fixed point. In Java, I use Bigdecimal for such a fact, in C is a code that my college Marathon team used at the time. So we have different references
– Jefferson Quesado
The difference between
float
anddouble
is not the difference of bits used? Generallyfloat
uses 32 bits in IEEE 754 format, whiledouble
uses 64 bits. That is,double
has a higher range and more precision.– Woss
@Jeffersonquesado So will you have a specific one for each language? If yes, I will delete the question!
– Marconi
@Andersoncarloswoss does not want to try an answer?
– Marconi
Decimal is language dependent and each has its own version. C# has
BigInteger
andBigRational
. In c++ specific libraries are required for this, such as these– Isac
So, I even said generally because I’m not sure if it’s like this for all languages. I need to check. And at the moment I am on the cell phone, it is impossible to answer for it. Who knows later I do something, if no one has answered.
– Woss
@Marconi does not need to remove the question. Someone with more cause knowledge can answer, or mark as off topic, or mark as too wide.
– Jefferson Quesado
@Jeffersonquesado tranquil, grateful for the tips :)
– Marconi
@Marconi source interesting -> http://www.macoratti.net/12/c_num1.htm
– Nosredna
@Nosredna thank you.
– Marconi