Floating Point Numbers – Computer Architecture

Floating-point numbers are used to represent real numbers and approximations to real numbers.

The set of real numbers consists of the union of the set of rational numbers and the set of irrational numbers.

The set of rational number is the set of numbers that have fractional representations. Irrational numbers can be thought of as those without fractional representations but which have decimal expansions that neither terminate or become periodic (pi).

Not all real numbers (rational U irrational) have a finite decimal expansion. This should be obvious since by definition, the decimal expansions of the irrationals don’t terminate. In addition, there are rational numbers that don’t have a finite decimal expansion. For example, 1/3. The decimal expansion is 0.333…

Similarly, not all real numbers have a finite binary expansion and so must be approximated by computers. Take for example, 1/5. Even though it can be represented by a finite decimal expansion, namely 0.20, it can not be represented by a finite binary expansion – it can only be approximated.

Binary expansion	Rational	Decimal expansion
0.0	0/2	0.0
0.01	1/4	0.25
0.010	2/8	0.25
0.0011	3/16	0.1875
0.00110	6/32	0.1875
0.001101	13/64	0.203125
0.0011010	26/128	0.203125
0.00110011	51/256	0.19921875

As shown above we can only approximate 1/5 and the value of the approximation is $0.20 - \epsilon$ where $\epsilon$ depends on how many bits are used.

Notice that binary expansion in the first column can be expressed in the form $x \times 2^y$ , shown in the second column. For example, the last binary representation can be expressed as $51/256 = 51 \times 2^{-8}$ .

IEEE Floating-Point Number Standard

Representing large values in the decimal expansion form $b_m b_{m-1} b_{m-2} ... b_1 b_0.b_{-1} b_{-2} ... b_{-n+1} b_n$ would be inefficient. For example, the value $5 \times 2^{100}$ would require 103 bits (101 followed by 100 0’s).

The IEEE floating-point standard represents numbers of the form $(-1)^S \times M \times 2^E$ by storing binary representations of S, M, and E.

S holds the sign bit. S = 1 for negative numbers, S = 0 for positive numbers, the value zero is handled separately.
The significand M is a fractional binary number between 0 and $1 - \epsilon$ or between 1 and $2 - \epsilon$ .
The exponent E is a whole number (possibly negative).

A single precision (float) uses 32 bits.

frac: 23 bits [0 – 22] encodes M – but the value depends on if E is zero (see below)
exp: 8 bits [23 – 30] encodes E
s: 1 bit [31] encodes S

Double precision (double) uses 64 bits.

frac: 52 bits [0 – 51]
exp: 11 bits [52 – 62]
s: 1 bit [63]

Using this encoding we need to be able to represent

Values in some arbitrary range including both positive and negative numbers
Zero
Small numbers close to zero
$+\infty and -\infty$

Lets assume we’re working on a 64 bit machine. The schema is similar for 32 bit architecture.

This resulting value of a number represented using the IEEE floating-point format depends on whether exp is all 0’s, is all 1’s, or neither.

Case 1: Normalized Form (exp is not all 0’s and not all 1’s)

exp does not encode E directly

When viewing exp as a field for positive numbers we can store values in the range: [1,2046].
But we need a way to encode negative exponents (for very small numbers) as well. To do this without a sign bit we simply encode E with a Bias (1023). That is, $exp = E + 1023$ . This allows us the following range of exponents: [-1022,1023].
To determine the value of E that is stored in exp, subtract 1023 from exp: $E = exp - 1023$ .

frac does not encode M directly

$M = 1 + frac = 1.f_{n-1}f_{n-2}...f_0$ . So $1 \leq M < 2$ .
This implies we can not encode 0 in this form.

Case 2: Denormalized Form (exp is all 0’s)

$E = 1 - <em>Bias = 1 - 1023 = -1022$
$M = frac = 0.f_{n-1}f_{n-2}...f_0$

This form allows us to represent 0 since $0 \leq M < 1$ .

s = 0
exp is all 0’s
frac is all 0’s

When s = -1, exp is all 0’s, and frac is all 0’s, the value is -0. With IEEE floating-point standard positive and negative zero are the same in some ways and are different in others.

Denormalized form also allows us to represent very small numbers that are close to 0.

Case 3: Special Form (exp is all 1’s)

When frac is all 0’s the resulting value is

$+\infty$ when s is 0 i.e. 0x7ff00000000000
$-\infty$ when s is 1 i.e. 0ffff000000000000

When frac is non zero, the resulting value represents NAN. This can be used when computing $\infty - \infty$ .

Rounding

4 types of rounding

Round-up
Round-down
Round-to-zero
Round-to-even (aka round-to-nearest)
- Rounds to nearest for values not halfway between two bounds.
- For values halfway between two values, it rounds the value such that the least significant digit of the result is even. For example, it rounds both 1.5 and 2.5 to 2.0.
- Avoids statistical bias (eg. mean) most of the time.

Floating-Point Operations

Due to rounding, floating-point addition and multiplication are not associative.

For example,

(3.14+1e10)-1e10 evaluates to 0
3.14+(1e10-1e10) evaluates to 3.14

and

(1e20*1e20)*1e-20 evaluates to $+\infty$
1e20*(1e20*1e-20) evaluates to 1e20

Floating-Point in C

C does not define infinity, or NaN. But gcc defines them in math.h.

When casting values between int, float, and double formats, the values are changed as follows:

int to float
- the value cannot overflow but can be rounded
int or float to double
- the exact value can be preserved because double has greater range and precision
double to float
- value can overflow to [linux]+\infty[/linux] or [linux]-\infty[/linux] since range is smaller
- may be rounded since precision is smaller
float or double to int
- value will be rounded toward 0
- value may overflow.
  - e.g. (int) +1e10 yields -21483648