Similar to scientific notation, the floating point notation represents real numbers without requiring an actual binary point. In particular, the IEEE standard for floating point numbers sets out the following format:

Where the leading “” is not required to be encoded because it’s implied by the standard.

single precision

the float datatype (implemented in most languages like Java or C) specifies a 32-bit sequence, allocated as follows.

  • 1 bit for the sign (positive or negative)
  • 8 bits for the range (exponent field)
  • 23 bits for precision (the fraction or mantissa field)

double precision

the double datatype specifies a 64-bit sequence, allocated as follows.

  • 1 bit for the sign
  • 11 bits for the range (still , just more precision)
  • 52 bits for mantissa field

comparison to scientific notation

say we have a number in scientific notation.

the first bit is for the sign, self-explanatory. instead of an exponent with two decimal digits (the in ), we instead have 8 binary digits.

the actual exponent being represented is the unsigned number in the data type, minus . to calculate from above, we have . we then convert to binary. this becomes our exponent field.

denormalized exponent ()

zero equals and so on, but floating point specifies . we don’t have a leading 1 in this case, so what do we do?

instead, we use a denormalized exponent, which means the exponent field equals

the IEEE standard tells us to interpret this exponent field as . that is, we don’t do any calculations, don’t need to solve for , this signifies . we multiply this with our mantissa to get the number we’re representing.

it allows us to represent very small numbers. but it results in positive and negative zeroes. this is fine in practice because they’re technically equal.

other non-standard exponents ()

other weird floating point numbers exist, like when the exponent field is . this occurs when we set the mantissa to zero, and it encodes positive/negative infinity.

I assume we adjust the sign bit to signify positive or negative.

we can also encode , or not a number. this occurs when set the mantissa to a non-zero number, but exponent field is still . for example,

conversion from decimal

there are two steps: we first convert decimal to binary Fixed point. then, we convert to the floating point standard.

Example

say we wish to convert . in binary, this is equivalent to .

next, we convert to floating point.

we first normalize it by shifting the decimal point two places to the left, so that there’s only one number to the left of the point. remember, this is a binary point, not a decimal point, so everything is in powers of .

since the number is positive, the sign bit is 0. to calculate the exponent field, we perform . converting this to binary gives us

finally, we have 23 bits for the fractional field, since currently only uses 5 vits, we pad the end out with zeroes.

here, would be considered the fraction field. combining these into a 32-bit sequence, we have the floating point representation of .

limitations

IEEE single/double precision has been the dominant standard since 1985. often times, however, newer fields like machine learning trades higher computational efficiency for lower precision.

see The bfloat16 numerical format and Half-precision floating-point format.