Numerical representations - fixed and floating
There are two main ways to represent real numbers in binary:
- Fixed point (Fixed point)
- Floating point (floating point)
In summary, fixed points are represented by fixing the binary point. Therefore, fixed the size of the whole and fractional part of the number to be stored.
- Values to the left of this point are greater than 1 (2, 4, 8 ...);
- Values to the right of this point are less than 1 (½, ¼ ...).
The fixed point can simplify the processing of real numbers, although it imposes significant limitations on the range and accuracy of the numbers that can be stored, since the binary point is literally fixed.
Floating points are the most common means of representing real numbers in binary. In short, it uses the binary pattern of scientific notation, so that it has three parts:
- The sign (which is represented by a bit to the left of the number);
- The base number (may also be referred to as mantissa), which determines the precision of the number to be stored;
- The exponent, representing the intermission of the number to be stored.
Note that the total number of bits which may be used in that representation is limited (32 bits on single or 64 precision machines bits in double precision machine). Thus, greater precision will require in a greater number of bits necessary for the mantissa, which will entail a reduction in the number of bits available for the exponent, which will decrease the number interval.
Think of binary floating point numbers as binary scientific notation.
The IEEE 754 standard
Note that although there is a definition for representing numbers, there is in fact no standard for standardizing this type of operation with respect to the real point numbers floating.
In view of this need for standardization, the Institute of Electrical and Electronics Engineers defined in 1985 the first version of the IEEE 754 standard (also known as the IEEE standard for floating-point arithmetic). It was heavily based on the works of William Kahan.
This pattern solved several problems, such as:
- Arithmetic formats (decimal data sets and floating point binaries);
- Standardized exchange formats (encodings that can be used for the exchange of floating-point numerical values in an efficient and compact way);
- Standardised rounding rules;
- Arithmetic operations (as trigonometric functions);
- Standardization of how to handle exceptions in optional cases (such as overflow, division by zero etc).
Note then that the pattern, in thesis, "created little new thing". His main objective was standardize the use of existing technologies and techniques to facilitate topics such as those set out above.
Currently, most architectures and systems already use this standard, which was not true a few decades ago. Imagine how difficult it was to exchange numerical representations between two machines that represented real numbers in different ways?
Several versions of this specification have already been published. The latest date is 2019.
I am not going to dwell too much on how each of these rules works. For this, see the page on Wikipedia (or the specification itself).
Nor will I explain (again) how the representation postulated by IEEE 754 works, since it has already been explained in this excellent response, here also, although more briefly, and here.
IEEE (Institute of Electrical and Electronics Engineers) Standard 754 for Binary Floating-Point Arithmetic. See: https://en.wikipedia.org/wiki/IEEE_754 and also https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF.
– anonimo