notes:ieee_754-1985

**This is an old revision of the document!**

The IEEE 754 standard defines a representation of floating point numbers and rules for manipulating them. The standard was published in 1985 and revised in 2008 - this page currently describes only the original 1985 standard.

Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases.

The principle is similar to exponential notation for numbers (e.g. **3.523 x 10 ^{6}**) except that IEEE floating point values use powers of 2 instead of 10.

The standard defines two precisions - **single** is a **32-bit** representation and **double** is a **64-bit** representation. These typically correspond to the C types `float`

and `double`

on platforms where the underlying hardware uses IEEE-754 (which is to say most current hardware architectures).

Each value is split into three chunks of bits, with byte boundaries ignored:

Section | Single | Double | Meaning |
---|---|---|---|

Sign | 1 bit | 1 bit | Set if the number is negative |

Exponent | 8 bits | 11 bits | The power of 2 to multiply |

Significand | 23 bits | 52 bits | The sigificant digits of the number |

Note that the **significand** is also known as the **mantissa** in some texts, although this is discouraged by the IEEE 754 standards committee and others because of confusion with other uses of the term.

This single bit is **0 for positive** numbers and **1 for negaive** numbers. Note that the sign bit is valid in most cases even for special values - for example, the standard differentiates between positive and negative zero.

The exponent indicates the power of 2 by which the significand is multiplied. It is stored in **biased form**, which is an easy way to store a signed value in an unsigned field by simply adding a fixed value. The range of the significant, and the bias value which is added to it to obtain the actual unsigned value stored, is:

Precision | Bias | Range of valid exponents |
---|---|---|

Single | 127 | -126 – 127 |

Double | 1023 | -1022 – 1023 |

Note that the range is missing the values at each end — this is because a zero exponent and an exponent with all bits set both have special meanings.

This portion of the value stores the significant binary digits. To save space, there is assumed to be a leading **1** digit. For example, the significand **1.0100111…** is stored as **0100111…**. This form is said to be **normalised**.

Note that this is a simplification as there is also a **denormalised** form for values near zero, which is described below.

The most common form of IEEE floating point numbers is the normalised form — this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored).

If the standard C library is available, the frexp function normalises a floating point value such that the fractional part will be in the range **0.5 ≤ x < 1.0**. Multiplying this value by **2** and reducing the exponent by **1** yields a value in the desired range **1.0 ≤ x < 2.0**. At this point the leading digit can then be discarded as the implicit leading **1** (see the Significand section for details).

notes/ieee_754-1985.1360856577.txt.gz · Last modified: 2013/02/14 15:42 by andy