Basic Notions of Floating Point Arithmetic

Posted by Beetle B. on Wed 23 January 2019

Basic Notions For a binary floating point system, if \(x\) is normal, then the leading bit is 1. Otherwise it is 0. If we have some other mechanism to denote normality, then we need not store this redundant bit. Often, this information is embedded into the exponent bits.

System p emin emax
32 bits 23+1 -126 127
64 bits 52+1 -1022 1023

Many pocket calculators use \(\beta=10\). One reason for this is that ordinary numbers like 0.1 are not representable in \(\beta=2\).

Base 10 is also used for financial calculations. Rounding Functions :rounding: A machine number is a number that can be represented exactly in the floating point system.

If \(x,y\) are machine numbers, then IEEE-754 mandates that when computing \(x*y\) where \(*\in\{+,-,\times,\div\}\), we should get \(o(x*y)\). In words, imagine computing it to infinite precision, and then rounding.

This property gives some advantages:

  • Full compatibility between computing systems
  • Helps a lot in the mathematical analysis of operations

The IEEE-754 (2008) now recommends correct rounding of several elementary functions. See the book for the list.

If \(x\) and \(y\) are floating point numbers, such that \(x/2\le y\le2x\), and the floating point has denormals and correct rounding, then \(x-y\) is a floating point number (no rounding needed).

The book does not have a proof for this and I did not try proving it. ULPs :ulp:

\(\ulp(0)\) is defined to be \(\beta^{e_{\min}-p+1}\). Fused Multiply Add :fma: Some benefits of FMA:

  • Exact computation of division remainders
  • Evaluation of polynomials via Horner’s Rule is much faster

Beware that you could kind of violate the monotonicity of rounding if you clumsily use the FMA. Consider evaluating \(\sqrt{x^{2}-y^{2}}\). Let \(x=y=1+2^{-52}\). Note that this is a machine number. Now say you compute \(x^{2}-y^{2}\) as \(\RN(\RN(x^{2})-y\times y)\). \(x^{2}\) is approximated down and then when you subtract \(y^{2}\), and then round, the closest number to the result is actually negative. The result is NaN.

Now if you just had used elementary operators, then you wouldn’t have this problem because monotonicity is guaranteed for those operators. IEEE 754 (2008) In the IEEE-754 (2008) standard, the most significant bit is the sign bit, then the exponent, then the significand (with leading bit omitted).

Now note that the exponent bits gives a non-negative number. So you define a bias. For 32 bits, the bias is 127. For 64 bits, the bias is 1023. The basic idea is this: Let \(E\) be the exponent represented by the exponent bits. If \(E=0\), the number is either 0 or a denormal. Don’t adjust. If \(E>0\), subtract the bias to get the real exponent. There is one exception, where \(E\) is all 1’s, which is reserved for infinities and NaN’s.

How do we differentiate between a NaN and an infinity? For infinity, the significand is all 0’s. Otherwise, it is a NaN.

One property of this way of ordering the bits: If you want the next largest floating point number, simply add 1.