Tag handbookfp

Languages and Compilers

Here are some concerns: Say you want to compute \(a+b+c+d\) and they are all of 32 bit precision, but the machine supports 64 bit....

Compensated Polynomial Evaluation

The book provides an algorithm. I didn’t bother writing the details.

Compensated Dot Products

When the condition number is not high, one can do the naive algorithm for the dot product. Otherwise, one should do the...

Computing Sums More Accurately

Reordering the Operands, and a Bit More General ideas: Sort all your summands in ascending order (magnitude). Even more complex, sort...

Computing Validated Running Error Bounds

The problem with the previous error bounds is that they are in terms of quantities like \(\sum|a_{i}|\), which are not known in advance,...

Properties For Deriving Validated Running Error Bounds

Theorem for FMA Let \(x,y,z\) be nonnegative floating point numbers. Assuming underflow does not occur, then \(xy+z\le...

Some Refined Error Estimates

Let the rounding mode be RN. Assume no overflow occurs. Then if you do recursive summation, the following inequality related to the...

Notation For Error Analysis and Classical Error Estimates

Unless specified otherwise, everything in this chapter assumes no underflow. We often will have many factors of \((1+\epsilon_{i})\),...

Evaluation of the Error of an FMA

The error of an FMA calculation is not always a floating point number. However, we can use two floating point numbers to exactly...

Multiplication by an Arbitrary Precision Constant with an FMA

Suppose you need to multiply by a constant that is not exactly representable. Think \(\pi\) and the like. We’d like to multiply and...

Conversions Between Integers and Floating Point Numbers

This is a short section with some details and magic numbers. I did not bother.

Radix Conversion Algorithms

The algorithms are in the book. I did not reproduce them here. I did not read the rest of the section. There are a lot more details there.

Conditions on the Formats

This section deals with changing bases. The most obvious application is to go back and forth between decimal and binary (to make it easy...

Newton-Raphson Based Square Root With FMA

The Basic Methods One way is to use Newton’s iteration on \(f(x)=x^{2}-a\). This method for calculating square root goes back thousands...

Possible Double Rounding in Division Algorithms

This section deals with floating point \(a,b\) values, not necessarily between 1 and 2. Assume they are non-negative, though. For this,...

Using The Newton Iteration For Correctly Rounded Division With FMA

We need to calculate \(o(a/b)\) where \(a,b\) are binary floating point numbers, and \(o\) is RN, RD, RU or RZ. We have a useful proof:...

Variants of the Newton Raphson Iteration

Assume \(\beta=2\) for this section. Some of it may not work for decimal. We want to approximate \(b/a\). Assume \(1\le a,b<2\). In...

Another Splitting Technique: Splitting Around a Power of 2

In this section, assume \(\beta=2\). Now given a floating point \(x\), we want to form two floating point numbers \(x_{h}\) and...

Computation of Residuals of Division and Square Root With an FMA

For this article, define a representable pair for a floating point number \(x\) to be any pair \((M,e)\) such that...

Accurate Computation of the Product of Two Numbers

The 2MultFMA Algorithm This has been covered elsewhere. It works well when you use FMA. If No FMA Is Available If there is no FMA...

Accurate Computation of the Sum of Two Numbers

Let \(a,b\) be two floating point numbers. Let \(s\) be \(\RN(a+b)\). Regardless of which number it picks in a tie, it can be shown that...

Exact Multiplications and Divisions

When you multiply a floating point number by a power of \(\beta\), the result is exact provided there is no over or underflow. Another...

Exact Addition

Sterbenz’s Lemma: If your floating point system has denormals, and if \(x,y\) are non-negative, finite floating point numbers such that...

Computing The Precision

To get \(p\) of the floating point system you are on: i = 0 A = 1.0 B = 2 # The radix. while (A + 1.0) - A == 1.0: A = B * A i += 1...

Computing The Radix

Suppose we want to compute the radix of a floating point system. The code below will do it for you - it works assuming the...

IEEE Support in Programming Languages

Do not assume that the operations in a programming language will map to the ones in the standard. The standard was originally written...

Rest of chapter

I skipped the rest of the chapter (inlcuding hardware details).

Special Values

NaN Signaling NaNs do not appear as the result of arithmetic operations. When they appear as an operand, they signal an...

Default Exception Handling

Invalid The default result of such an operation is a quiet NaN. The operations that lead to Invalid are: Most operations on a...

Conversions To/From String Representations

This section addresses how one can convert a character sequence into a decimal/binary floating point number. Decimal Character Sequence...

Comparisons

The standard requires that you can compare any two floating point numbers, as long as they share the same radix. The unordered condition...

Attributes and Rounding

Rounding Direction Attributes IEEE 754-2008 requires that the following be correctly rounded: Arithmetic operations: Addition...

Operations Specified By The Standards

Arithmetic Operations and Square Root Handling Signed 0 If \(x,y\) are nonzero, and \(x+y=0\) or \(x-y=0\) exactly, then it is \(+0\)...

Formats

The standard defines several interchange formats to allow for transferring floating point data between machines. They could be as bit...

Note on the Choice of Radix

It has been shown that \(\beta=2\) gives better worst case and average accuracy than all other bases. if...

Lost and Preserved Properties of Arithmetic

Floating point addition and multiplication are still commutative. Associativity is compromised, though. An example: Let...

Floating Point Exceptions

In IEEE-754, the implementer can signal an exception along with the result of the operation. Usually (or perhaps mandated?), the signal...

Fused Multiply Add

Let \(o\) be the rounding function, and \(a,b,c\) are floating point numbers. Then \(\mathrm{FMA}(a,b,c)\) is \(o(ab+c)\). if...

ULP Errors vs Relative Errors

Converting From ULP Errors to Relative Errors Let \(x\) be in the normal range, and \(|x-X|=\alpha\ulp(x)\). Then: \begin{equation*}...

The ULP Function

There are multiple definitions of unit in the last place. I think most agree when \(x\) is not near a boundary point. Here is the...

Relative Error Due To Rounding

Ranges The normal range is the set of real numbers: \(\beta^{e_{\textit{min}}}\le|x|\le\Omega\) and the subnormal range are where...

Rounding Functions

The IEEE 754-2008 specifies five rounding functions: Round toward \(-\infty\) (RD): It is the largest floating point number less than or...

The Other “Numbers”

0 (some systems have signed 0’s as well) NaN for any invalid operation \(\infty\) (some systems are signed, some are not). In the IEEE...

Underflow

Underflow before rounding occurs when the absolute value of the exact value is strictly less than \(\beta^{e_{\textit{min}}}\) (i.e. the...

Normalizing

We would like a unique way to represent \(x\). One approach is to pick the one which gives the smallest exponent possible (while still...

Definitions

A radix \(\beta\) floating point number \(x\) is of the form \(m\beta{e}\), where \(|m|<\beta\) is called the significand and \(e\) is...