notes:ieee_754-1985

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||

notes:ieee_754-1985 [2013/02/20 16:19] andy |
notes:ieee_754-1985 [2013/02/24 00:12] andy [Zero] |
||
---|---|---|---|

Line 5: | Line 5: | ||

Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases. | Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases. | ||

- | The principle is similar to exponential notation for numbers (e.g. **3.523 x 10<sup>6</sup>**) except that IEEE floating point values use powers of 2 instead of 10. | + | The principle is similar to exponential notation for numbers (e.g. $3.523 \times 10^6$) except that IEEE floating point values use powers of 2 instead of 10. |

Line 45: | Line 45: | ||

The most common form of IEEE floating point numbers is the normalised form --- this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored). As explained above, the significand stores only the digits after the leading **1**, which is implicit. | The most common form of IEEE floating point numbers is the normalised form --- this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored). As explained above, the significand stores only the digits after the leading **1**, which is implicit. | ||

- | If the standard C library is available, the ''[[man>frexp|frexp()]]'' function normalises a floating point value such that the fractional part will be in the range **0.5 ≤ x < 1.0**. Multiplying this value by **2** and reducing the exponent by **1** yields a value in the desired range **1.0 ≤ x < 2.0**. At this point the leading digit can then be discarded as the implicit leading **1** (see the [[#Significand]] section for details). | + | If the standard C library is available, the ''[[man>frexp|frexp()]]'' function normalises a floating point value such that the fractional part will be in the range $0.5 \le \times < 1.0$. Multiplying this value by **2** and reducing the exponent by **1** yields a value in the desired range $1.0 \le \times < 2.0$. At this point the leading digit can then be discarded as the implicit leading **1** (see the [[#Significand]] section for details). |

If ''frexp()'' is available then it should be used, as it will likely use the underlying hardware representation to avoid expensive loops. However, a naive implementation can quite simply mimic its functionality --- the following version demonstrates the principle, but a production version would also need to check for special values (zero, NaN, infinities) as well as catching under- and overflows: | If ''frexp()'' is available then it should be used, as it will likely use the underlying hardware representation to avoid expensive loops. However, a naive implementation can quite simply mimic its functionality --- the following version demonstrates the principle, but a production version would also need to check for special values (zero, NaN, infinities) as well as catching under- and overflows: | ||

Line 90: | Line 90: | ||

^ Significand | Zero | | ^ Significand | Zero | | ||

- | A value of exactly zero is represented by a exponent and significand of zero. The sign bit may be set or unset and IEEE 754 has the concept of both a positive and negative zero. For standard comparisons, however, these will both compare equal with zero, so the comparison **-0.0 < 0.0** yields **false**. | + | A value of exactly zero is represented by a exponent and significand of zero. The sign bit may be set or unset and IEEE 754 has the concept of both a positive and negative zero. For standard comparisons, however, these will both compare equal with zero, so the comparison $-0.0 < 0.0$ yields **false**. |

To determine the sign of a floating point value including zero, the ''[[man>copysign|copysign()]]'' function can be used with a non-zero value, or the ''[[man>signbit|signbit()]]'' macro can be used more directly on some platforms (not available on WinCE, for example). | To determine the sign of a floating point value including zero, the ''[[man>copysign|copysign()]]'' function can be used with a non-zero value, or the ''[[man>signbit|signbit()]]'' macro can be used more directly on some platforms (not available on WinCE, for example). |

notes/ieee_754-1985.txt · Last modified: 2013/02/24 00:18 by andy