Exact Addition

Posted by Beetle B. on Wed 20 March 2019

Sterbenz’s Lemma:

If your floating point system has denormals, and if \(x,y\) are non-negative, finite floating point numbers such that \(y/2\le x\le2y\), then \(x-y\) is a floating point number.

Proof:

We can assume \(y\le x\le2y\). Let \(M_{x},e_{x}\) be the significand and exponent of \(x\), and define similar variables for \(y\). Now we can easily show that \(e_{y}\le e_{x}\). Now let \(\delta=e_{x}-e_{y}\). Then:

\begin{equation*} x-y=\left(M_{x}\beta^{\delta}-M_{y}\right)\beta^{e_{y}-p+1}=M\beta^{e_{y}-p+1} \end{equation*}

Now we know that \(M\ge0\) because \(x\ge y\).

We know that \(M\) is an integer.

From \(x\le2y\) we know that \(x-y\le y\). Thus \(M\beta^{e_{y}-p+1}\le M_{y}\beta^{e_{y}-p+1}\). This gives \(M\le M_{y}\le\beta^{p}-1\). The proof is complete.

We need denormals because \(x-y\) could be a denormal.

Another theorem: If \(x\) and \(y\) are floating point numbers, and if \(\RN(x+y)\) is subnormal, then \(\RN(x+y)=x+y\).

Proof: All floating point numbers are integral multiples of \(\alpha\). Thus so is \(x+y\). If it is a subnormal number, then it is less than \(\beta^{e_{\min}}\). Thus it is exactly representable. All multiples of \(\alpha\) that are subnormal are exactly representable!