## Floating Point Accuracy

### History

Floating point method was introduced historically during computer design to overcome the limited available computer memory. Different applications need to store numbers of widely varying magnitudes to different levels of accuracy. Setting fixed numbers of integer and fractional digits would limit the applications, floating point numbers overcome this problem.

Floating point denotes numbers as two sequences of bits: a significand representing the digits in the number; and an exponent which determines the position of the decimal (radix) point. Negative significands represent negative numbers; negative exponents represent numbers close to zero.

Computer hardware uses floating point in binary format to IEEE-754 standard. The usual formats are 32 or 64 bits in total length:

Format | Total bits | Significand bits | Exponent bits | Smallest number | Largest number |
---|---|---|---|---|---|

Single precision | 32 | 23 + 1 sign | 8 | ca. 1.2 ⋅ 10^{-38} |
ca. 3.4 ⋅ 10^{38} |

Double precision | 64 | 52 + 1 sign | 11 | ca. 5.0 ⋅ 10^{-324} |
ca. 1.8 ⋅ 10^{308} |

### Rounding errors

Floating-point numbers can't represent all real numbers accurately due to the limited number of digits: when there are more digits than the format allows, the number is rounded by omitting the extra digits. This is necessary because:

- Large Denominators - In any base, the larger the denominator of an (irreducible) fraction, the more digits it needs in positional notation. A sufficiently large denominator will require rounding, no matter what the base or number of available digits is.
- Periodical digits - Any (irreducible) fraction where the denominator has a prime factor that does not occur in the base requires an infinite number of digits that repeat periodically after a certain point.
- Non-rational numbers - Non-rational numbers cannot be represented as a regular fraction at all, and in positional notation they require an infinite number of non-recurring digits.

#### Further Reading

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

### Glossary

Real Numbers - a value that represents a quantity along a line including all the rational and irrational numbers.

Rational Number - a value that can be expressed as the fraction p/q of two integers, a numerator p and a non-zero denominator q - rational numbers form a dense subset of the real numbers. A rational number always either terminates after a finite number of digits or begins to repeat the same finite sequence of digits over and over

Irrational Number - A Real Number that is not Rational. The decimal expansion of an irrational number continues without repeating. Since the set of rational numbers is countable, and the set of real numbers is uncountable, almost all real numbers are irrational.