Floating point

Similar to scientific notation, the floating point notation represents real numbers without requiring an actual binary point. In particular, the IEEE standard for floating point numbers sets out the following format:

+ / - 1. YYYYY \cdot 2^{N - 127}

Where the leading “ $1.$ ” is not required to be encoded because it’s implied by the standard.

single precision

the float datatype (implemented in most languages like Java or C) specifies a 32-bit sequence, allocated as follows.

1 bit for the sign (positive or negative)
8 bits for the range (exponent field)
23 bits for precision (the fraction or mantissa field)

double precision

the double datatype specifies a 64-bit sequence, allocated as follows.

1 bit for the sign
11 bits for the range (still $N - 127$ , just more precision)
52 bits for mantissa field

comparison to scientific notation

say we have a number in scientific notation.

6.023 \cdot 1 0^{23}

the first bit is for the sign, self-explanatory. instead of an exponent with two decimal digits (the $23$ in $1 0^{23}$ ), we instead have 8 binary digits.

the actual exponent being represented is the unsigned number in the data type, minus $127$ . to calculate $N$ from above, we have $N = exponent + 127$ . we then convert $N$ to binary. this becomes our exponent field.

denormalized exponent ( $00000000$ )

zero equals $0.0 = 0000$ and so on, but floating point specifies $1. YYYYY$ . we don’t have a leading 1 in this case, so what do we do?

instead, we use a denormalized exponent, which means the exponent field equals

00000000

the IEEE standard tells us to interpret this exponent field as $- 126$ . that is, we don’t do any calculations, don’t need to solve for $N$ , this signifies $- 126$ . we multiply this with our mantissa to get the number we’re representing.

2^{- 126} \cdot mantissa represented in powers of two

it allows us to represent very small numbers. but it results in positive and negative zeroes. this is fine in practice because they’re technically equal.

other non-standard exponents ( $11111111$ )

other weird floating point numbers exist, like when the exponent field is $11111111$ . this occurs when we set the mantissa to zero, and it encodes positive/negative infinity.

\frac{1}{0} = \infty and - \frac{1}{0} = - \infty

I assume we adjust the sign bit to signify positive or negative.

we can also encode $NaN$ , or not a number. this occurs when set the mantissa to a non-zero number, but exponent field is still $11111111$ . for example,

- 42 = NaN

conversion from decimal

there are two steps: we first convert decimal to binary Fixed point. then, we convert to the floating point standard.

Example

say we wish to convert $5.62 5_{10}$ . in binary, this is equivalent to $101.10 1_{2}$ .
$4 + 1 + \frac{1}{2} + \frac{1}{8} = 5.62 5_{10}$
next, we convert to floating point.

we first normalize it by shifting the decimal point two places to the left, so that there’s only one number to the left of the point. remember, this is a binary point, not a decimal point, so everything is in powers of $2$ .
$1.01101 \times 2^{2} = 101.101$
since the number is positive, the sign bit is 0. to calculate the exponent field, we perform $N = 2 (the exponent) + 127 = 129$ . converting this to binary gives us
$10000001$
finally, we have 23 bits for the fractional field, since $1.01101$ currently only uses 5 vits, we pad the end out with zeroes.
$1.01101000000000000000000$
here, $01101000000000000000000$ would be considered the fraction field. combining these into a 32-bit sequence, we have the floating point representation of $5.625$ .
$01000000101101000000000000000000$

limitations

IEEE single/double precision has been the dominant standard since 1985. often times, however, newer fields like machine learning trades higher computational efficiency for lower precision.

see The bfloat16 numerical format (the “b” stands for brain, as in Google Brain, who created the Data type) and Half-precision floating-point format.

All Notes

otherworld

single precision

double precision

comparison to scientific notation

denormalized exponent ( $00000000$ )

other non-standard exponents ( $11111111$ )

conversion from decimal

limitations

Graph View

Table of Contents

Backlinks

All Notes

Floating point

single precision

double precision

comparison to scientific notation

denormalized exponent (00000000)

other non-standard exponents (11111111)

conversion from decimal

limitations

Graph View

Table of Contents

Backlinks

denormalized exponent ( $00000000$ )

other non-standard exponents ( $11111111$ )