Basic Notions For a binary floating point system, if \(x\) is normal, then the leading bit is 1. Otherwise it is 0. If we have some other mechanism to denote normality, then we need not store this redundant bit. Often, this information is embedded into the exponent bits.

System | p | e_{min} |
e_{max} |
---|---|---|---|

32 bits | 23+1 | -126 | 127 |

64 bits | 52+1 | -1022 | 1023 |

Many pocket calculators use \(\beta=10\). One reason for this is that ordinary numbers like 0.1 are not representable in \(\beta=2\).

Base 10 is also used for financial calculations. Rounding Functions
:rounding: A **machine number** is a number that can be represented
exactly in the floating point system.

If \(x,y\) are machine numbers, then IEEE-754 mandates that when computing \(x*y\) where \(*\in\{+,-,\times,\div\}\), we should get \(o(x*y)\). In words, imagine computing it to infinite precision, and then rounding.

This property gives some advantages:

- Full compatibility between computing systems
- Helps a lot in the mathematical analysis of operations

The IEEE-754 (2008) now *recommends* correct rounding of several
elementary functions. See the book for the list.

If \(x\) and \(y\) are floating point numbers, such that \(x/2\le y\le2x\), and the floating point has denormals and correct rounding, then \(x-y\) is a floating point number (no rounding needed).

The book does not have a proof for this and I did not try proving it. ULPs :ulp:

\(\ulp(0)\) is defined to be \(\beta^{e_{\min}-p+1}\). Fused Multiply Add :fma: Some benefits of FMA:

- Exact computation of division remainders
- Evaluation of polynomials via Horner’s Rule is much faster

Beware that you could kind of violate the monotonicity of rounding if you clumsily use the FMA. Consider evaluating \(\sqrt{x^{2}-y^{2}}\). Let \(x=y=1+2^{-52}\). Note that this is a machine number. Now say you compute \(x^{2}-y^{2}\) as \(\RN(\RN(x^{2})-y\times y)\). \(x^{2}\) is approximated down and then when you subtract \(y^{2}\), and then round, the closest number to the result is actually negative. The result is NaN.

Now if you just had used elementary operators, then you wouldn’t have this problem because monotonicity is guaranteed for those operators. IEEE 754 (2008) In the IEEE-754 (2008) standard, the most significant bit is the sign bit, then the exponent, then the significand (with leading bit omitted).

Now note that the exponent bits gives a non-negative number. So you
define a **bias**. For 32 bits, the bias is 127. For 64 bits, the bias
is 1023. The basic idea is this: Let \(E\) be the exponent
represented by the exponent bits. If \(E=0\), the number is either 0
or a denormal. Don’t adjust. If \(E>0\), subtract the bias to get
the real exponent. There is one exception, where \(E\) is all 1’s,
which is reserved for infinities and NaN’s.

How do we differentiate between a NaN and an infinity? For infinity, the significand is all 0’s. Otherwise, it is a NaN.

One property of this way of ordering the bits: If you want the next largest floating point number, simply add 1.