notes:ieee_754-1985

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision | ||

notes:ieee_754-1985 [2013/02/20 12:59] andy |
notes:ieee_754-1985 [2013/02/24 00:18] andy [NaN] |
||
---|---|---|---|

Line 5: | Line 5: | ||

Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases. | Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases. | ||

- | The principle is similar to exponential notation for numbers (e.g. **3.523 x 10<sup>6</sup>**) except that IEEE floating point values use powers of 2 instead of 10. | + | The principle is similar to exponential notation for numbers (e.g. $3.523 \times 10^6$) except that IEEE floating point values use powers of 2 instead of 10. |

Line 45: | Line 45: | ||

The most common form of IEEE floating point numbers is the normalised form --- this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored). As explained above, the significand stores only the digits after the leading **1**, which is implicit. | The most common form of IEEE floating point numbers is the normalised form --- this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored). As explained above, the significand stores only the digits after the leading **1**, which is implicit. | ||

- | If the standard C library is available, the ''[[man>frexp|frexp()]]'' function normalises a floating point value such that the fractional part will be in the range **0.5 ≤ x < 1.0**. Multiplying this value by **2** and reducing the exponent by **1** yields a value in the desired range **1.0 ≤ x < 2.0**. At this point the leading digit can then be discarded as the implicit leading **1** (see the [[#Significand]] section for details). | + | If the standard C library is available, the ''[[man>frexp|frexp()]]'' function normalises a floating point value such that the fractional part will be in the range $0.5 \le \times < 1.0$. Multiplying this value by **2** and reducing the exponent by **1** yields a value in the desired range $1.0 \le \times < 2.0$. At this point the leading digit can then be discarded as the implicit leading **1** (see the [[#Significand]] section for details). |

If ''frexp()'' is available then it should be used, as it will likely use the underlying hardware representation to avoid expensive loops. However, a naive implementation can quite simply mimic its functionality --- the following version demonstrates the principle, but a production version would also need to check for special values (zero, NaN, infinities) as well as catching under- and overflows: | If ''frexp()'' is available then it should be used, as it will likely use the underlying hardware representation to avoid expensive loops. However, a naive implementation can quite simply mimic its functionality --- the following version demonstrates the principle, but a production version would also need to check for special values (zero, NaN, infinities) as well as catching under- and overflows: | ||

Line 90: | Line 90: | ||

^ Significand | Zero | | ^ Significand | Zero | | ||

- | A value of exactly zero is represented by a exponent and significand of zero. The sign bit may be set or unset and IEEE 754 has the concept of both a positive and negative zero. For standard comparisons, however, these will both compare equal with zero, so the comparison **-0.0 < 0.0** yields **false**. | + | A value of exactly zero is represented by a exponent and significand of zero. The sign bit may be set or unset and IEEE 754 has the concept of both a positive and negative zero. For standard comparisons, however, these will both compare equal with zero, so the comparison $-0.0 < 0.0$ yields **false**. |

To determine the sign of a floating point value including zero, the ''[[man>copysign|copysign()]]'' function can be used with a non-zero value, or the ''[[man>signbit|signbit()]]'' macro can be used more directly on some platforms (not available on WinCE, for example). | To determine the sign of a floating point value including zero, the ''[[man>copysign|copysign()]]'' function can be used with a non-zero value, or the ''[[man>signbit|signbit()]]'' macro can be used more directly on some platforms (not available on WinCE, for example). | ||

Line 112: | Line 112: | ||

Since any non-zero significand value is permitted, this allows a range of values to be specified. The sign bit may also be set or unset, which could be used to differentiate between different types of NaN. The main distinction is between a **quiet NaN** which has the most-significant bit of the significand set and a **signalling NaN** which has the MSB of the significand clear((This applies to most architectures, although on PA-RISC and MIPS the sense of this bit is inverted --- the IEEE 754-2008 standard clarified this such that the bit is set for quiet NaNs.)) (although the overall value must still be non-zero). | Since any non-zero significand value is permitted, this allows a range of values to be specified. The sign bit may also be set or unset, which could be used to differentiate between different types of NaN. The main distinction is between a **quiet NaN** which has the most-significant bit of the significand set and a **signalling NaN** which has the MSB of the significand clear((This applies to most architectures, although on PA-RISC and MIPS the sense of this bit is inverted --- the IEEE 754-2008 standard clarified this such that the bit is set for quiet NaNs.)) (although the overall value must still be non-zero). | ||

+ | The intention of the **signalling NaN** is that this will raise some sort of exception, and then go on to yield a **quiet NaN** if a result is required. This means that each error is only raised once. However, the support for this on various platforms seems to differ. | ||

+ | |||

+ | There are three types of operations which can produce a NaN value: | ||

+ | |||

+ | * Operations which are provided an existing NaN value as an argument. | ||

+ | * Operations whose results are mathematically indeterminate - some examples are listed below: | ||

+ | * $0.0 / 0.0$ and $\pm\infty / \pm\infty$ | ||

+ | * $0.0 \times \pm\infty$ | ||

+ | * $\infty - \infty$ and equivalents | ||

+ | * Operations which yield complex results - some examples are listed below: | ||

+ | * $\sqrt{-n}$ | ||

+ | * $\log{-n}$ | ||

+ | * $\sin^{-1}{x}$ or $\cos^{-1}{x}$ where $x < -1$ or $x > 1$ | ||

+ | |||

+ | ===== Limits ===== | ||

+ | |||

+ | Values in the table below apply regardless of sign, since the sign bit is an independent quantity. Base 10 values are all rounded to one decimal place. | ||

+ | |||

+ | |||

+ | ^ Limit ^ Single precision ^^ Double precision ^^ | ||

+ | ^::: ^ Base 2 ^ Base 10 ^ Base 2 ^ Base 10 ^ | ||

+ | ^ Smallest denormal | 2⁻²³ x 2⁻¹²⁶ | 1.4 x 10⁻⁴⁵ | 2⁻⁵² x 2⁻¹⁰²² | 4.9 x 10⁻³²⁴ | | ||

+ | ^ Middle denormal | 2⁻¹ x 2⁻¹²⁶ | 5.9 x 10⁻³⁹ | 2⁻1 x 2⁻¹⁰²² | 1.1 x 10⁻³⁰⁸ | | ||

+ | ^ Largest denormal | (1--2⁻²³) x 2⁻¹²⁶ | 1.2 x 10⁻³⁸ | (1--2⁻⁵²) x 2⁻¹⁰²² | 2.2 x 10⁻³⁰⁸ | | ||

+ | ^ Smallest normal | 1 x 2⁻¹²⁶ | 1.2 x 10⁻³⁸ | 1 x 2⁻¹⁰²² | 2.2 x 10⁻³⁰⁸ | | ||

+ | ^ Middle normal | 1 x 2⁶³ | 9.2 x 10¹⁸ | 1 x 2⁵¹¹ | 6.7 x 10¹⁵³ | | ||

+ | ^ Largest normal | (2--2⁻²³) x 2¹²⁷ | 3.4 x 10³⁸ | (2--2⁻⁵²) x 2¹⁰²3 | 1.8 x 10³⁰⁸ | | ||

notes/ieee_754-1985.txt · Last modified: 2013/02/24 00:18 by andy