The IEEE 754-2008 specifies five rounding functions:
- Round toward \(-\infty\) (RD): It is the largest floating point number less than or equal to \(x\).
- Round toward \(\infty\) (RU): It is the smallest floating point number greater than or equal to \(x\).
- Round toward zero (RZ): It is the closest floating point number whose absolute value is no greater than that of \(x\).
- Round ties to even (RN): When \(x\) falls exactly halfway between two consecutive floating point numbers, pick the one that is even. Otherwise round to the nearest.
- Round ties to away (RN): When \(x\) falls exactly halfway between two consecutive floating point numbers, pick the one that is of greater magnitude. Otherwise round to the nearest.
Round ties to even is the default rounding in IEEE 754-2008.
A result of a function is called correctly rounded if the function was first computed with infinite precision and unlimited range, then rounded using one of the functions.
For \(\beta=2\), with precision \(p\), and a normal \(x\) (i.e. \(x\ge2^{e_{\textit{min}}}\)), let the infinitely precise significand be \(1.m_{1}m_{2}m_{3}\dots\). Then define the round bit to be \(m_{p}\) and the sticky bit to be the bitwise OR of \(m_{p+1}\) onwards.
How one would round is shown in the table below:
round | sticky | RD | RU | RN |
---|---|---|---|---|
0 | 0 | |||
0 | 1 | |||
1 | 0 | -/+ | ||
1 | 1 |
A \(-\) means that the significand is merely truncated.
A \(+\) means that you truncate, and then add \(2^{-p+1}\) to the result.
A \(-/+\) means it is the halfway case.
RD, RU and RZ are called the direct rounding modes.
A rounding breakpoint is the value where the rounding function changes value. For RD, RU and RZ, the breakpoints are floating point numbers. For RN, they are the halfway points in between floating point numbers.
Useful Properties
All the rounding functions are monotonically increasing functions.
Handling Denormals and Large Values
Let \(\alpha=\beta^{e_{\textit{mind}}-p+1}\) - the smallest denormal number.
Let \(\Omega=(\beta-\beta^{1-p})\beta^{e_{\textit{max}}}\) be the largest floating point number.
Then:
- RN for even is 0 if \(0<x\le\alpha/2\)
- RN for even is \(+\infty\) if \(x\ge(\beta-\beta^{1-p}/2)\beta^{e_{\textit{max}}}\)
- RN for away is 0 if \(0<x<\alpha/2\)
- RN for away is \(+\infty\) if \(x\ge(\beta-\beta^{1-p}/2)\beta^{e_{\textit{max}}}\)
- RD is 0 if \(0<x<\alpha\)
- RD is \(\Omega\) if \(x\ge\Omega\)
- RU is \(\alpha\) if \(0<x\le\alpha\)
- RU is \(+\infty\) if \(x>\Omega\)