Relative Error Due To Rounding

Posted by Beetle B. on Tue 15 January 2019

Ranges The normal range is the set of real numbers: \(\beta^{e_{\textit{min}}}\le|x|\le\Omega\) and the subnormal range are where \(|x|<\beta^{e_{\textit{min}}}\).

Note that these aren’t the set of floating point values. They are the set of real numbers. Relative Error For a given rounding function \(o(x)\), the relative error is

\begin{equation*} \epsilon(x)=\left|\frac{x-o(x)}{x}\right| \end{equation*}

for nonzero \(x\) and 0 if \(x\) is zero.

Now if \(x\) is in the normal range, then the bounds on the relative error are:

  • \(\epsilon<\frac{1}{2}\beta^{1-p}\) for RN
  • \(\epsilon<\beta^{1-p}\) for RD, RU and RZ

Proof:

\(x\) is nonzero, as it is in the normal range. If \(x\) is a floating point number, \(\epsilon(x)=0\). Otherwise, \(|x|\in\left(\beta^{e},\beta^{e+1}\right)\) for some \(e\). The distance between two consecutive floating point numbers in this set is \(\beta^{e-p+1}\). Thus, we know that \(|x-\textit{RN}(x)|\le\frac{1}{2}\beta^{e-p+1}\) and \(|x-o(x)|\le\beta^{e-p+1}\) for direct rounding. To get the relative error, divide by |x|, noting that the smallest possible value of |x| is greater than \(\beta^{e}\)

The unit roundoff \(\textbf{u}\) for a \(\beta\) and precision \(p\) floating point system is:

\begin{equation*} \frac{1}{2}\textrm{ulp}=\frac{1}{2}\beta^{1-p} \end{equation*}

for RN, and:

\begin{equation*} \textrm{ulp}=\beta^{1-p} \end{equation*}

for direct rounding modes.

Let \(x\) be in the normal range. Then the relative error \(\epsilon(x)\) when rounding \(x\) to the nearest satisfies:

\begin{equation*} \epsilon(x)\le\frac{\beta^{1-p}}{2+\beta^{1-p}}=\frac{\textbf{u}}{1+\textbf{u}} \end{equation*}

To prove this, we split into two cases:

Case I, when \(|x|\ge\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\), and Case II when \(\beta^{e}\le|x|<\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\)

Before diving into the proof, let’s discuss what these two cases represent. We know that a floating point number is of the form \(M\beta^{e-p+1}\) where \(M\) is a \(p\) character integer in base \(\beta\). Now for a floating point number to be normal, this means that the leading character of \(M\) is 1. The smallest normal number for a given $e$ is where \(M=100\dots0=\beta^{p-1}\). This leads to the smallest normal number (for a given \(e\)) being \(\beta^{p-1}\beta^{e-p+1}=\beta^{e}\). Let’s call this number \(f\).

Now what is the next to smallest normal number for a given \(e\) (i.e. the next floating point number)? It has \(M=10\dots01=(1+\beta^{p-1})\). The number would be \((1+\beta^{p-1})\beta^{e-p+1}=\beta^{e}(1+\beta^{1-p})\). Let’s call this number \(g\).

Now case II above is when \(f\le|x|<g\), but with \(|x|\) closer to \(f\) than to \(g\) (i.e. less than half way between \(f\) and \(g\) - there is no tie here). In words, \(|x|\) will round to the smallest normal number \(f=\beta^{e}\) with the appropriate sign. Case I is all other cases with that exponent.

Now Case I is simple to handle:

By assumption, \(|x|\ge\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\). We know that \(|x-\textrm{RN}(x)|\le\frac{1}{2}\beta^{e-p+1}\). Divide the two quantities to get the desired inequality.

For Case II, we have \(\textrm{RN}(x)=\textrm{sign}(x)\cdot f\). Thus \(|x-\textrm{RN}(x)|=|x|-f=|x|-\beta^{e}\). Divide by \(|x|\) to get \(\epsilon(x)\), and then utilize the fact that \(\beta^{e}\le|x|<\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\).

For a subnormal number, the relative error can be anywhere from 0 to 1. Standard Models Let \(*\in\{+,-,\times,\div\}\) be any of those operations. Let \(o\) be any rounding function we’ve considered. Then for all floating point numbers \(a,b\) such that \(a*b\) neither underflows (i.e. inexact subnormal result) or overflows, then the following are true:

\begin{equation*} o(a*b)=(a*b)(1+\epsilon_{1}),|\epsilon_{1}|\le\textbf{u} \end{equation*}
\begin{equation*} o(a*b)=(a*b)/(1+\epsilon_{2}),|\epsilon_{2}|\le\textbf{u} \end{equation*}

These two properties are called the first and second standard models of floating point arithmetic.

I may be missing something deeper, but the first seems trivially true. \(a*b\) is a number in the normal range, so it follows immediately.

The second one follows if you utilize:

\begin{equation*} \epsilon(x)=\frac{|z-o(z)|}{|z|}\le\frac{\textbf{u}}{1+\textbf{u}} \end{equation*}

But that identity is valid only for round to nearest. What about the other rounding mechanisms?

Here’s a proof from On Relative Errors of Floating Point Operations:

For a given number in the normal range \(z\), \(\textrm{ufp}(z)\le|o(z)|\le\beta\textrm{ufp}(z)\) where \(\textrm{ufp}(z)\) is the smallest value with the same exponent as \(z\).

Now \(|z|\) belongs in the interval \([\textrm{ufp}(z),\beta\textrm{ufp}(z))\), and this interval contains \((\beta-1)\beta^{p-1}\) floating point values. The distance between any two values is \(\textbf{u}\textrm{ufp}(z)\). Thus \(|z-o(z)|\le\textbf{u}\textrm{ufp}(z)\). Divide by \(|o(z)|\) to get:

\begin{equation*} \frac{|z|}{|o(z)|}-1\le\textbf{u}\frac{\textrm{ufp}(z)}{|o(z)|} \end{equation*}

But we showed above that the coefficient of \(\textbf{u}\) is bound by 1. This will complete the proof - set the LHS to \(\epsilon_{2}\).

The first statement can be written as:

\begin{equation*} o(a*b)=(a*b)(1+\epsilon_{1}),|\epsilon_{1}|\le\frac{\textbf{u}}{1+\textbf{u}} \end{equation*}

For the case of round to nearest.

Now let’s consider a more general case: Where the result of \(x*y\) could be in the subnormal range. So now, as long as there is no overflow:

\begin{equation*} o(x*y)=(x*y)(1+\epsilon)+\epsilon' \end{equation*}

where \(|\epsilon|\le\textbf{u}\) and \(|\epsilon'|\le\beta^{e_{\textrm{min}}}\ \textbf{u}\).

Specifically, if the result is in the normal range, \(\epsilon'=0\). The result is just from the standard model. If the result is in the subnormal range, then \(\epsilon=0\). The result follows from the following:

\begin{equation*} \left|x-o(x)\right|<\beta^{e_{\textit{min}}-p+1} \end{equation*}