User Tools

Site Tools


notes:ieee_754-1985

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
notes:ieee_754-1985 [2013/02/20 16:19]
andy
notes:ieee_754-1985 [2013/02/24 00:18]
andy [NaN]
Line 5: Line 5:
 Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases. Briefly, floating point numbers are a way to represent a wide numeric range in a limited set of bits by storing a limited number of significant digits. As the absolute size of the number increases, the absolute precision decreases.
  
-The principle is similar to exponential notation for numbers (e.g. **3.523 10<sup>6</​sup>​**) except that IEEE floating point values use powers of 2 instead of 10.+The principle is similar to exponential notation for numbers (e.g. $3.523 \times ​10^6$) except that IEEE floating point values use powers of 2 instead of 10.
  
  
Line 45: Line 45:
 The most common form of IEEE floating point numbers is the normalised form --- this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored). As explained above, the significand stores only the digits after the leading **1**, which is implicit. The most common form of IEEE floating point numbers is the normalised form --- this is where the exponent has a value in the valid range (once the bias has been subtracted from the unsigned value stored). As explained above, the significand stores only the digits after the leading **1**, which is implicit.
  
-If the standard C library is available, the ''​[[man>​frexp|frexp()]]''​ function normalises a floating point value such that the fractional part will be in the range **0.5 ≤ x < 1.0**. Multiplying this value by **2** and reducing the exponent by **1** yields a value in the desired range **1.0 ≤ x < 2.0**. At this point the leading digit can then be discarded as the implicit leading **1** (see the [[#​Significand]] section for details).+If the standard C library is available, the ''​[[man>​frexp|frexp()]]''​ function normalises a floating point value such that the fractional part will be in the range $0.5 \le \times ​< 1.0$. Multiplying this value by **2** and reducing the exponent by **1** yields a value in the desired range $1.0 \le \times ​< 2.0$. At this point the leading digit can then be discarded as the implicit leading **1** (see the [[#​Significand]] section for details).
  
 If ''​frexp()''​ is available then it should be used, as it will likely use the underlying hardware representation to avoid expensive loops. However, a naive implementation can quite simply mimic its functionality --- the following version demonstrates the principle, but a production version would also need to check for special values (zero, NaN, infinities) as well as catching under- and overflows: If ''​frexp()''​ is available then it should be used, as it will likely use the underlying hardware representation to avoid expensive loops. However, a naive implementation can quite simply mimic its functionality --- the following version demonstrates the principle, but a production version would also need to check for special values (zero, NaN, infinities) as well as catching under- and overflows:
Line 90: Line 90:
 ^ Significand | Zero | ^ Significand | Zero |
  
-A value of exactly zero is represented by a exponent and significand of zero. The sign bit may be set or unset and IEEE 754 has the concept of both a positive and negative zero. For standard comparisons,​ however, these will both compare equal with zero, so the comparison ​**-0.< 0.0** yields **false**.+A value of exactly zero is represented by a exponent and significand of zero. The sign bit may be set or unset and IEEE 754 has the concept of both a positive and negative zero. For standard comparisons,​ however, these will both compare equal with zero, so the comparison ​$-0.< 0.0yields **false**.
  
 To determine the sign of a floating point value including zero, the ''​[[man>​copysign|copysign()]]''​ function can be used with a non-zero value, or the ''​[[man>​signbit|signbit()]]''​ macro can be used more directly on some platforms (not available on WinCE, for example). To determine the sign of a floating point value including zero, the ''​[[man>​copysign|copysign()]]''​ function can be used with a non-zero value, or the ''​[[man>​signbit|signbit()]]''​ macro can be used more directly on some platforms (not available on WinCE, for example).
Line 118: Line 118:
   * Operations which are provided an existing NaN value as an argument.   * Operations which are provided an existing NaN value as an argument.
   * Operations whose results are mathematically indeterminate - some examples are listed below:   * Operations whose results are mathematically indeterminate - some examples are listed below:
-    * **0.0 / 0.0** and **±∞ ​±∞** +    * $0.0 / 0.0and $\pm\infty ​\pm\infty$ 
-    * **0.0 x ±∞** +    * $0.0 \times \pm\infty$ 
-    * **∞ -- ∞** and equivalents+    * $\infty ​\infty$ ​and equivalents
   * Operations which yield complex results - some examples are listed below:   * Operations which yield complex results - some examples are listed below:
-    * **√--n** +    * $\sqrt{-n}$ 
-    * **log(--n)** +    * $\log{-n}$ 
-    * **sin⁻¹(x)** or **cos⁻¹(x)** where **x < --1** or **x > 1**+    * $\sin^{-1}{x}$ or $\cos^{-1}{x}$ where $x < -1or $x > 1$
  
 ===== Limits ===== ===== Limits =====
notes/ieee_754-1985.txt · Last modified: 2013/02/24 00:18 by andy