User Tools

Site Tools


notes:ieee_754-1985

IEEE 754-1985

The IEEE 754 standard defines a representation of floating point numbers and rules for manipulating them. The standard was published in 1985 and revised in 2008 - this page currently describes only the original 1985 standard.

Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases.

The principle is similar to exponential notation for numbers (e.g. $3.523 \times 10^6$) except that IEEE floating point values use powers of 2 instead of 10.

Binary Format

The standard defines two precisions - single is a 32-bit representation and double is a 64-bit representation. These typically correspond to the C types float and double on platforms where the underlying hardware uses IEEE 754 (which is to say most current hardware architectures).

Each value is split into three chunks of bits, with byte boundaries ignored:

Section Single Double Meaning
Sign 1 bit 1 bit Set if the number is negative
Exponent 8 bits 11 bits The power of 2 to multiply
Significand 23 bits 52 bits The sigificant digits of the number

Note that the significand is also known as the mantissa in some texts, although this is discouraged by the IEEE 754 standards committee and others because of confusion with other uses of the term.

Sign bit

This single bit is 0 for positive numbers and 1 for negative numbers. Note that the sign bit is valid in most cases even for special values - for example, the standard differentiates between positive and negative zero.

Exponent

The exponent indicates the power of 2 by which the significand is multiplied. It is stored in biased form, which is an easy way to store a signed value in an unsigned field by simply adding a fixed value. The range of the significant, and the bias value which is added to it to obtain the actual unsigned value stored, is:

Precision Bias Range of valid exponents
Single 127 -126 – 127
Double 1023 -1022 – 1023

Note that the range is missing the values at each end — this is because a zero exponent and an exponent with all bits set both have special meanings.

Significand

This portion of the value stores the significant binary digits. To save space, there is assumed to be a leading 1 digit. For example, the significand 1.0100111… is stored as 0100111…. This form is said to be normalised.

Note that this is a simplification as there is also a denormalised form for values near zero, which is described below.

Normalised Values

The most common form of IEEE floating point numbers is the normalised form — this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored). As explained above, the significand stores only the digits after the leading 1, which is implicit.

If the standard C library is available, the frexp() function normalises a floating point value such that the fractional part will be in the range $0.5 \le \times < 1.0$. Multiplying this value by 2 and reducing the exponent by 1 yields a value in the desired range $1.0 \le \times < 2.0$. At this point the leading digit can then be discarded as the implicit leading 1 (see the Significand section for details).

If frexp() is available then it should be used, as it will likely use the underlying hardware representation to avoid expensive loops. However, a naive implementation can quite simply mimic its functionality — the following version demonstrates the principle, but a production version would also need to check for special values (zero, NaN, infinities) as well as catching under- and overflows:

double sample_frexp(double value, int *exponent)
{
    *exponent = 0;
    if (value < 1.0) {
        while (value <= 0.5) {
            value *= 2.0;
            --(*exponent);
        }
    } else {
        while (value >= 1.0) {
            value /= 2.0;
            ++(*exponent);
        }
    }
    return value;
}

Denormalised Values

At the lower end of the scale, very small numbers can be stored in denormalised form, where the implicit leading digit is a 0 instead of 1. In IEEE 754 this is represented by an exponent field of all zeroes and a non-zero significand. The actual exponent that this value represents is one higher than would be expected from a zero exponent field:

Precision Denormalised Exponent
Single -126
Double -1022

At first sight it appears that this introduces overlap with the normalised numbers, as these are the lowest value exponents for a normalised value. However, the leading zero in the significand means that in fact there's no overlap.

Special Values

Aside from normalised and denormalised numbers, there are a variety of values represented by specific bit patterns in the representation.

Zero

Sign bit Any
Exponent Zero
Significand Zero

A value of exactly zero is represented by a exponent and significand of zero. The sign bit may be set or unset and IEEE 754 has the concept of both a positive and negative zero. For standard comparisons, however, these will both compare equal with zero, so the comparison $-0.0 < 0.0$ yields false.

To determine the sign of a floating point value including zero, the copysign() function can be used with a non-zero value, or the signbit() macro can be used more directly on some platforms (not available on WinCE, for example).

Infinity

Sign bit Any
Exponent All bits set
Significand Zero

If all bits are set in the exponent and the signficand is zero then the value represented is either positive or negative infinity, depending on the sign bit.

NaN

Sign bit Any
Exponent All bits set
Significand Non-zero

If all bits are set in the exponent and the significand is non-zero then the value represented is not a number often abbreviated to NaN. This is a special range of values which are returned by operations that don't yield a valid arithmetic value.

Since any non-zero significand value is permitted, this allows a range of values to be specified. The sign bit may also be set or unset, which could be used to differentiate between different types of NaN. The main distinction is between a quiet NaN which has the most-significant bit of the significand set and a signalling NaN which has the MSB of the significand clear1) (although the overall value must still be non-zero).

The intention of the signalling NaN is that this will raise some sort of exception, and then go on to yield a quiet NaN if a result is required. This means that each error is only raised once. However, the support for this on various platforms seems to differ.

There are three types of operations which can produce a NaN value:

  • Operations which are provided an existing NaN value as an argument.
  • Operations whose results are mathematically indeterminate - some examples are listed below:
    • $0.0 / 0.0$ and $\pm\infty / \pm\infty$
    • $0.0 \times \pm\infty$
    • $\infty - \infty$ and equivalents
  • Operations which yield complex results - some examples are listed below:
    • $\sqrt{-n}$
    • $\log{-n}$
    • $\sin^{-1}{x}$ or $\cos^{-1}{x}$ where $x < -1$ or $x > 1$

Limits

Values in the table below apply regardless of sign, since the sign bit is an independent quantity. Base 10 values are all rounded to one decimal place.

Limit Single precision Double precision
Base 2 Base 10 Base 2 Base 10
Smallest denormal 2⁻²³ x 2⁻¹²⁶ 1.4 x 10⁻⁴⁵ 2⁻⁵² x 2⁻¹⁰²² 4.9 x 10⁻³²⁴
Middle denormal 2⁻¹ x 2⁻¹²⁶ 5.9 x 10⁻³⁹ 2⁻1 x 2⁻¹⁰²² 1.1 x 10⁻³⁰⁸
Largest denormal (1–2⁻²³) x 2⁻¹²⁶ 1.2 x 10⁻³⁸ (1–2⁻⁵²) x 2⁻¹⁰²² 2.2 x 10⁻³⁰⁸
Smallest normal 1 x 2⁻¹²⁶ 1.2 x 10⁻³⁸ 1 x 2⁻¹⁰²² 2.2 x 10⁻³⁰⁸
Middle normal 1 x 2⁶³ 9.2 x 10¹⁸ 1 x 2⁵¹¹ 6.7 x 10¹⁵³
Largest normal (2–2⁻²³) x 2¹²⁷ 3.4 x 10³⁸ (2–2⁻⁵²) x 2¹⁰²3 1.8 x 10³⁰⁸
1) This applies to most architectures, although on PA-RISC and MIPS the sense of this bit is inverted — the IEEE 754-2008 standard clarified this such that the bit is set for quiet NaNs.
notes/ieee_754-1985.txt · Last modified: 2013/02/24 00:18 by andy