What is the correct way to use the float, double and decimal types?

Asked

Viewed 45,560 times

46

Since college I can’t understand the real difference between the type DOUBLE and FLOAT, I ended up meeting the guy DECIMAL which also treats real values.

About the type DECIMAL, found the following statement:

For any calculation involving money or finance, the guy Decimal should always be used. Only this type has the precision suitable to avoid critical rounding errors.

Why?

The Decimal data type is simply a type of floating point that is represented internally as base 10 instead of base two.

In what situations should these types be used correctly? Could you explain to me why decimal type is ideal when money is involved? What’s the difference between the two?

  • 1

    DECIMAL normally do not see as floating point, but as fixed point. For example, in sql server, this is how it works

  • At a glance: https://answall.com/questions/11340/que-datatypes-double-float-ou-decimal-eu-should-usar-para-representmo

  • I think the guy DECIMAL language-dependent :/

  • @Brunocastro will be that it is dup?

  • Marconi, you requested language independent, the question linked by @Brunocastro is from C#, so the scopes are distinct

  • @Jeffersonquesado yes, but it should be the same for both languages. Everywhere I look I see same calculations.

  • In Java and C, I always try to make monetary calculations with a fixed point. In Java, I use Bigdecimal for such a fact, in C is a code that my college Marathon team used at the time. So we have different references

  • The difference between float and double is not the difference of bits used? Generally float uses 32 bits in IEEE 754 format, while double uses 64 bits. That is, double has a higher range and more precision.

  • @Jeffersonquesado So will you have a specific one for each language? If yes, I will delete the question!

  • @Andersoncarloswoss does not want to try an answer?

  • 3

    Decimal is language dependent and each has its own version. C# has BigInteger and BigRational. In c++ specific libraries are required for this, such as these

  • 2

    So, I even said generally because I’m not sure if it’s like this for all languages. I need to check. And at the moment I am on the cell phone, it is impossible to answer for it. Who knows later I do something, if no one has answered.

  • @Marconi does not need to remove the question. Someone with more cause knowledge can answer, or mark as off topic, or mark as too wide.

  • 1

    @Jeffersonquesado tranquil, grateful for the tips :)

  • 1

    @Marconi source interesting -> http://www.macoratti.net/12/c_num1.htm

  • @Nosredna thank you.

Show 11 more comments

2 answers

40


Non-integer numbers can be implemented based on binary or decimal. The shape can be with dot floating or fixed. Of course you can in other ways too.

At the fixed point the type already indicates how many decimal places of precision there are. The fixed point is less common. In general databases work with fixed point, but also have floating point type. Roughly we can say that the fixed point is like the char and the floating point is like the varchar, which, instead of saying how many characters it has, it says how many decimal places should be considered.

There are cases that you need to always have 2 digits, IE, only cents, should not have 1 and should not have 3. If you divide 1 by 8, gives 0,13 (the rounding method may vary). Floating point gives 0,125. If divided by 10, in fixed gives 0,10, on the floating gives 0,1 and it is the programmer’s problem to take care to normalize scale if he so wishes.

float and double

The difference between the float and the double is the precision, that is, how much it varies, how much it can express a value close to the real, is the number of decimal places it can support.

These guys are called binary floating point.

The float normally has 32 bits to represent the exponent and mantissa, in addition to the signal. It can represent many numbers, but by its binary nature it cannot represent all numbers, so it represents what is closer to what is desired. It has 24 digits of precision and so is called simple precision.

The double is usually 64-bit, so you get a lot more accuracy, but not accuracy yet, since the form of representation is also binary. It has 53 digits of precision and is called double precision.

There are standardized types, more rarely implemented, even larger 128-bit or quadruple accuracy (113 digits), or 256 bits or eightfold precision (237 digits). It also has the type of half precision 16-bit with 11 digits. Everything has the same problem of inaccuracy.

These types are governed by the standard IEEE 754. There are technologies that do not follow it, but it is rare. There are cases that the double possesses 80 bits out of standard.

Calculations can be performed by hardware or software, obviously the first is much faster. Even by software, as it operates in the natural way for computing, ie binary, is very fast.

Many calculations need precision, but not accuracy. So by computer get along better with it, use it. If it is something scientific of heavy calculations the performance makes a lot of difference. The same can be said of computer graphics or games. There will be a rounding anyway, so the lack of accuracy does not harm anything.

0.2 + 0.8 is different from 1.0?!?!?!?!

But if you’re going to make an equality comparison already complicates, 1 is different than 0.2 + 0.8. Crazy, right? This occurs by the way the number is represented internally. Already 1 + 1 will always be 2, since there is no problem of inaccuracies in the whole part.

If you can not represent all the numbers that normally a human is used to dealing will need a rounding, this can give a minimum difference here or there, ie can change a single penny. Then you multiply it by 1 million and 1 cent turns into 10 thousand reais, dollars, etc. of error. Can’t, right? Even 1 cent in accounting can not exist, not hit more. A bank balance can not have it. Has story told that in the past bank programmer seeing this began to pick up these differences of penny and was accumulating in your account and became millionaire (even if it is only a little bird counted, still illustrates the problem).

There is already an answer in Meaning a variable of double precision? with more details of how it works. Wikipedia and others links contained there and in other responses linked give more details to those who are curious to see all details.

decimal

The guy decimal has exactness, It is about having the exact number that is intended. It indicates that the number conforms to what is expected. It calls decimal by having base 10 and not binary like the previous ones.

Each technology implements it in some different way. It is common to store the whole and decimal part separately in integers, or to store everything together in an integer and determine a scale, i.e., where the floating point is, how many houses it should assume, in general is an integer divided by 1, 10, 100, 1000, etc.

The Decimal, which I will call pure, usually has 128 bits of precision (34 digits), but this varies if it is outside the standard, which is not so unusual.

Some technologies implement the SmallDecimal 64-bit (16 digits), and the TinyDecimal 32-bit (7 digits).

It is also common to have the BigDecimal with more than 128 bits, in general up to unlimited. It has technology that only has this decimal type. The names of the types can vary in each technology.

Being decimal, the performance is not the best, but far from being a tragedy. In general it is not a problem to work with monetary value and the calculations where it is involved are usually simple compared to scientific and CGI. The calculations are done with processor integer instructions, which is fast but needs several steps of number normalization, need to provide some rounding, often your code needs to do some extra account, then it ends up getting slower but nothing critical.

Rounding

Treat rounding when the requirement is accuracy is not easy task, each calculation may require a different policy. Think if you divide 1 by 3, give tithe, which can not be represented accurately, then need to determine where it goes, must have 3 houses? So you’d give 3 of 0.33? Okay, but if you add that up to 0.99. And then what do you do with that penny? You have to have a policy of your code to handle that, too. It may be that the disposal is justifiable, it may be that one of these installments receives this penny that is left and is worth 0,34. But in which? Do you need to generate a separate release to keep track of having done this? Is it the first one? The last one? And if you have several cents, will it still be the first or last? Or should you distribute? How? All programmer concern.

Yeah, it’s all wrong out there

A huge amount of software does wrong calculations, not only because they use binary floating point when they should use decimal, but also because they don’t know how to handle decimal rounding. It is very difficult to deal with it and many hurt legislation, cause harm.

In general the precise mathematical term, but actually the correct one is exact mathematics.

It is not only used monetarily, but is the best example of use. Where you want accuracy is with it that you should go. But a beginner may find that accuracy is always good at abandoning the binary floating point or a speed fanatic just staying in it. The decision should not go through this, use whatever is most suitable for the problem.

Accompany the tag to see several examples here.

This type is also governed by IEEE 754. See table taken from Wikipedia

Tabela de ponto flutuante IEEE 754

Note that the official names only indicate how many bits it has.

Other encodings

Nothing prevents you from using other encodings, but today it is not common. One that was widely used in the past was the BCD. Fractional implementations are also used where the most important thing is to accurately represent the fraction itself, meaning you don’t want to 0,333333333333, whether 1/3.

  • You said the float and double has 24 and 54(here wouldn’t be 53?) accuracy digits respectively. I read on a float site has 7 and the double has 15-16 digit accuracy. The Decimal is between 28-29 digits(only with the calculation I made below for decimal I managed to reach 34 also, I found it strange). I calculate: log(2 elevado a 53) ÷ log (10) for double and log(2 elevado a 24) for float. What lih is incorrect @bigown? Examples can be added in practice?

  • Yes, it was a typo. You’re talking about the decimal part, which I didn’t put.

21

About float and double

These are types defined by the IEEE. Their representation is given by the sign, exponent and mantissa. Without taking the details, if you have 3 digits to represent the mantissa:

d0 d1 d2
 1  1  0

The value of the mantissa is 11.

11? But I only saw two bits connected, 11 needs 3!

Yes, and has the third bit on. d3 is implicit for normalized numbers. And that bit is always connected under these conditions. The above mantissa is interpreted as if it were the following number:

d0 d1 d2 d3
 1  1  0  1

The exponent will result in any number within the range. I do not intend to go into more detail here. Let’s assume the resulting value is e for the exponent and m to the mantissa. The final value then is:

m * 2 ^ e

Like m is a number formed by the bits of the mantissa (shifted to the left to be between 1 and 2), we can rewrite it thus (to i being the bit position and q the total of bits):

m = somatório b_i * 2 ^ (i - q)

Then, replacing in the formula above:

somatório b_i * 2 ^ (e + i - q)

That is, every floating point number represented by this scheme is a sum of points of 2. Due to mathematical characteristics, every (finite) sum of powers of 2 has finite representation at base 10, but the opposite is not true. For example, it is impossible to represent 0.2 as a finite sum of powers of 2; you would represent it as a periodic tithe yes, but periodic tithes are not representable in format mantissa * base ^ expoente, being mantissa defined by a finite sum.

Since there are numbers that are not representable, they are approximated by good enough numbers. This generates a calculation error.

For each distinct exponent value, there is a distinct error associated with the calculation.

DECIMAL in SQL Server

In SQL Server, the type DECIMAL serves to indicate fixed point numbers. What does this mean? It means that we are working with whole numbers most of the time. The dwelling is fixed, its accuracy goes up to the least significant digit.

Its general form is:

n * 10 ^ (-s)

Where n is an integer (32, 64, 128 or 256 bit, according to the chosen precision; reference), and s is the scale, a positive number. Its accuracy goes up to 10 ^ (-s), lower values cannot be represented, therefore they must be rounded or truncated.

The error associated with the calculation is always less than 10 ^ (-s), often being mitigated using bank rounding.

Multiplication and division require special treatment in this field. The division will have the rounded or truncated result, as well as a special routine to discard the irrelevant values of multiplication.

BigDecimal in Java

If you are only interested in calculating, you do not need to know much more than using the methods of this class to calculate.

Generally speaking, it allows a input of arbitrary size with absurdly high precision.

Underneath the covers, it usually contains a BigInteger below and a scale. It has the same mathematical representation as the DECIMAL in SQL Server:

n * 10 ^ (-s)

Whereas here n is a variable integer number (the BigInteger previously mentioned).

The associated error is less than 10 ^ (-s), and it is possible to define the value of s running to be large enough. Bank rounding further mitigates the error.

Note that here we have a Java class that will do operations that are not directly supported by ULA, which consumes additional processing and memory usage.

Decimal in C#

I don’t have much to say about it for lack of experience. But from what I read, it looks a lot like the DECIMAL of SQL Server.

How to use each?

If you need precision in the calculation up to a certain scale, regardless of the value being calculated, you are in the case of using a BigDecimal or equivalent. In a sales system that I support, we use BigDecimal with an accuracy ranging from 6 to 30 digits (usually 30 for divisions, 6 for all other operations). Our tax values obtained have never been more accurate after migrating 100% of the calculation to these specifications.

float and double are faster, more efficient and more economical than BigDecimals of Java; I can’t say much about the Decimal of C#, but I believe for multiplication to be much lighter. Typically, in modern processors, there is a floating-point arithmetic processing core. Using this type of variable, the error incurred is proportional to the most significant value of the mantissa. This means a value of 1 that accepts an error of 2 ^ -4 means that the value of 0.25 accepts an error of 2 ^ -6.

Calculation of 30% tax

Let’s put an example of tax calculation to exemplify the error associated with the calculation of both types of data.

Let’s say we sell Persian cats. The tax on them is 30%. Knowing that I sold 72 cats at 524.7500 each, How much should I pay to the government of Tax?

Applying 30% means multiplying by 0.3.

Java and BigDecimal

30% tax is 30 shifted 2 houses to the left (or 3 shifted one house to the left). As it is an integer, and there was no division, there was no loss of precision. I multiply that by 72, an integer that I can represent without losing precision with the BigDecimal. 524.7500 is equivalent to 52475 moved two houses to the left. In all, after the multiplications, we will have an unrounded integer/exact integer value shifted four boxes to the left.

Calculus with float

524.75 is represented by the following sum of powers of 2:

512 + 8 + 4 + 0.5 + 0.25

Or else:

2^9 + 2^3 + 2^2 + 2^-1 + 2^-2

We can represent without data loss if there are 11 bits to the mantissa.

Why 11 digits of mantissa?

I was rereading this publication and was in doubt "why 11? Shouldn’t there be 12?" After all, we are working with digits from position 9 to position -2. This results in 12 houses! More specifically, ordered by significance: 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2.

What I myself had forgotten was that the most significant digit in non-denormalized numbers has the implicit value 1. This means that it is not necessary to store the digit from position 9, only those from positions 8 to position -2.

We need to multiply this value by 72, which is an integer value so we trivially know that it is a sum of powers of 2.

The result is then multiplied by 0.3. 0.3 has no finite representation as the sum of powers of 2, so it will be represented by a number close enough but not exact. If you have 11 bits for the mantissa, the accuracy of the result number of the 0.3 representation is 2 ^ -13, which means that the representative will have an error in the order of 2 ^ -13.

Saw how in a simple calculation it was possible to insert an error in one representation but not in another?

Infinite Series floating point

If you have a Youtube channel worth following is the Infinite Series. Recently, they went up last week a video talking about how it happens to floating point computing, and one of the examples given is that 0.1 + 0.1 != 0.2, for 0.1 in binary is a periodic decimate and therefore cannot be represented in scientific notation (using finite mantissa) in base 2. This is one of the main arguments against using floating-point arithmetic to resolve issues involving money.

Emphasizing here, the BigDecimal Java and many other schemes use integers of arbitrary size to represent mantissas, then place the decimal point at any point in that number. Although finite, as this scheme of calculation allows an arbitrary precision of p houses (hence error of 5 * 10 ^ -(p+1) when making calculations), we are sure in the calculation of these amounts (ie, 0.1 + 0.1 = 0.2 when you have p >= 1) in the chosen precision.

Computerphile floating point

The Computerphile channel ("computadófilo" in free interpretation) comments which floating point number is only scientific notation based on 2, with a limitation of representation. And that this is great for representing quantities as large as the size of the universe and the distance between the atomic nucleus and the orbit of an electron.

In these cases, the scientific notation is beneficial because it can represent significantly the numbers and the errors of the rounding of the calculation are within the expected. The rounding errors of these calculations can sometimes be less than the inherent error of measuring certain quantities (addendum mine, the channel does not comment on this, but it is true yes).

About the errors in the calculations, the example that the presenter provides is in the rendering of 3D graphics of a game. If by chance the rendering of a graphical element is offset one hundredth or one thousandth of a pixel, this error is acceptable and easily ignored in the player’s perception.

At one point in the video, the presenter speaks of an example of floating point financial calculation. Add 0.1 with 0.2 gives an unacceptable calculation error for financial applications. So, he suggests working with integers (in the unit of pennies or a fraction of the pennies) or else using the decimal coming in your programming language.

  • Who negatived could give a hint of how I can improve my response?

  • What would be the difference of this practice when we talk about PHP? This thread says that there is no difference... https://answall.com/questions/226826/em-php-existe-difference-entre-double-e-float?rq=1

  • @Hananiamizrahi on double and float and real He says there is no difference. I have not analyzed the implementation of PHP nor have I gone deep into tests to refute or affirm this fact. Anyway, if there is any difference is in the accuracy (mantissa and exponent values range). Already the decimal is an arbitrary point number, in Java it also has arbitrary mantissa size, in C# it is a fixed point with mantissa of 32/64/128 bits. There must be some PHP class that handles numbers the Java/C way#...

  • ... You should use this class to make financial/monetary computations, as they do not incur the intrinsic error of floating point arithmetic accuracy

Browser other questions tagged

You are not signed in. Login or sign up in order to post.