Hugues’s paper. Look it up.

This method is useful when your intermediate precision is the same as the target precision.

Given the input \(x\), the first step is to find a \(y\) such that \(f(x)\) can easily and accurately be calculated from \(f(y)\)....

In general, when approximating a function, we’ll split the domain into multiple intervals and approximate each one with a polynomial....

Here are some concerns: Say you want to compute \(a+b+c+d\) and they are all of 32 bit precision, but the machine supports 64 bit....

Sequential Evaluation of Polynomials If you don’t have any parallelism available, Horner’s scheme is a good option. And if you have the...

The book provides an algorithm. I didn’t bother writing the details.

When the condition number is not high, one can do the naive algorithm for the dot product. Otherwise, one should do the...

Reordering the Operands, and a Bit More General ideas: Sort all your summands in ascending order (magnitude). Even more complex, sort...

The problem with the previous error bounds is that they are in terms of quantities like \(\sum|a_{i}|\), which are not known in advance,...

Theorem for FMA Let \(x,y,z\) be nonnegative floating point numbers. Assuming underflow does not occur, then \(xy+z\le...

Let the rounding mode be RN. Assume no overflow occurs. Then if you do recursive summation, the following inequality related to the...

Unless specified otherwise, everything in this chapter assumes no underflow. We often will have many factors of \((1+\epsilon_{i})\),...

An Iterative Method Compute the minimax approximation in a wider format. Then round the coefficient of the constant term. Then recompute...

We discussed calculating the minimax polynomial using Remez’s Algorithm, but we overlooked some subtleties. While the algorithm does...

The error of an FMA calculation is not always a floating point number. However, we can use two floating point numbers to exactly...

Suppose you need to multiply by a constant that is not exactly representable. Think \(\pi\) and the like. We’d like to multiply and...

This is a short section with some details and magic numbers. I did not bother.

The algorithms are in the book. I did not reproduce them here. I did not read the rest of the section. There are a lot more details there.

This section deals with changing bases. The most obvious application is to go back and forth between decimal and binary (to make it easy...

The Basic Methods One way is to use Newton’s iteration on \(f(x)=x^{2}-a\). This method for calculating square root goes back thousands...

This section deals with floating point \(a,b\) values, not necessarily between 1 and 2. Assume they are non-negative, though. For this,...

We need to calculate \(o(a/b)\) where \(a,b\) are binary floating point numbers, and \(o\) is RN, RD, RU or RZ. We have a useful proof:...

Assume \(\beta=2\) for this section. Some of it may not work for decimal. We want to approximate \(b/a\). Assume \(1\le a,b<2\). In...

In this section, assume \(\beta=2\). Now given a floating point \(x\), we want to form two floating point numbers \(x_{h}\) and...

For this article, define a representable pair for a floating point number \(x\) to be any pair \((M,e)\) such that...

The 2MultFMA Algorithm This has been covered elsewhere. It works well when you use FMA. If No FMA Is Available If there is no FMA...

Let \(a,b\) be two floating point numbers. Let \(s\) be \(\RN(a+b)\). Regardless of which number it picks in a tie, it can be shown that...

When you multiply a floating point number by a power of \(\beta\), the result is exact provided there is no over or underflow. Another...

Sterbenz’s Lemma: If your floating point system has denormals, and if \(x,y\) are non-negative, finite floating point numbers such that...

To get \(p\) of the floating point system you are on: i = 0 A = 1.0 B = 2 # The radix. while (A + 1.0) - A == 1.0: A = B * A i += 1...

Suppose we want to compute the radix of a floating point system. The code below will do it for you - it works assuming the...

We never discussed how to calculate \(||f-p||_{\infty}\). Maple has a function to do this, but it can be inaccurate. Most people will...

Sometimes you need a fairly high degree polynomial to get reasonable accuracy, but can achieve a far greater accuracy with a much lower...

Remez’s algorithm is one that converges to the minimax polynomial of a function. The author recommends using a polynomial approximation...

Chebyshev vs Minimax Note that the best minimax polynomial approximation need not be the Chebyshev polynomial. The latter is the best...

The supremum norm is given by \(||f-p||_{\infty}=\max_{a\le x\le b}|f(x)-p(x)|\). It is denoted by \(L^{\infty}\). Given a function...

First, just a definition: A monic polynomial is one whose leading coefficient is 1. We want to find a polynomial of degree \(n\) that...

Do not assume that the operations in a programming language will map to the ones in the standard. The standard was originally written...

We often will approximate functions as polynomial or rational functions. When doing this, we introduce two types of errors:...

I skipped the rest of the chapter (inlcuding hardware details).

NaN Signaling NaNs do not appear as the result of arithmetic operations. When they appear as an operand, they signal an...

Invalid The default result of such an operation is a quiet NaN. The operations that lead to Invalid are: Most operations on a...

This section addresses how one can convert a character sequence into a decimal/binary floating point number. Decimal Character Sequence...

The standard requires that you can compare any two floating point numbers, as long as they share the same radix. The unordered condition...

Rounding Direction Attributes IEEE 754-2008 requires that the following be correctly rounded: Arithmetic operations: Addition...

Arithmetic Operations and Square Root Handling Signed 0 If \(x,y\) are nonzero, and \(x+y=0\) or \(x-y=0\) exactly, then it is \(+0\)...

The standard defines several interchange formats to allow for transferring floating point data between machines. They could be as bit...

The target format is the format of the result. The target precision is the precision of the target format. When computing polynomials,...

Let \(a,b\) be 2 floating point numbers. It can be shown that \((a+b)-\RN(a+b)\) is a floating point number. This may not be true for...

Basic Notions For a binary floating point system, if \(x\) is normal, then the leading bit is 1. Otherwise it is 0. If we have some...

It has been shown that \(\beta=2\) gives better worst case and average accuracy than all other bases. if...

Floating point addition and multiplication are still commutative. Associativity is compromised, though. An example: Let...

In IEEE-754, the implementer can signal an exception along with the result of the operation. Usually (or perhaps mandated?), the signal...

Let \(o\) be the rounding function, and \(a,b,c\) are floating point numbers. Then \(\mathrm{FMA}(a,b,c)\) is \(o(ab+c)\). if...

Converting From ULP Errors to Relative Errors Let \(x\) be in the normal range, and \(|x-X|=\alpha\ulp(x)\). Then: \begin{equation*}...

There are multiple definitions of unit in the last place. I think most agree when \(x\) is not near a boundary point. Here is the...

Ranges The normal range is the set of real numbers: \(\beta^{e_{\textit{min}}}\le|x|\le\Omega\) and the subnormal range are where...

The IEEE 754-2008 specifies five rounding functions: Round toward \(-\infty\) (RD): It is the largest floating point number less than or...

0 (some systems have signed 0’s as well) NaN for any invalid operation \(\infty\) (some systems are signed, some are not). In the IEEE...

Underflow before rounding occurs when the absolute value of the exact value is strictly less than \(\beta^{e_{\textit{min}}}\) (i.e. the...

We would like a unique way to represent \(x\). One approach is to pick the one which gives the smallest exponent possible (while still...

A radix \(\beta\) floating point number \(x\) is of the form \(m\beta{e}\), where \(|m|<\beta\) is called the significand and \(e\) is...