[latexpage]

Fixed-point mathematics is a method for representing numbers on a binary computer architecture. It allows the storage numbers with decimal points, similar to the float and double, but with the benefit of requiring less computation time. The trade-off is lower precision and flexibility.

A standard notation for fixed point numbers is to represent the type by fp(x, y), where x is the number of bits to the left of the decimal point, and y is the number of bits to the right of the decimal point.

If you think about it, normal integers are just a special case of a fixed-point number in where the decimal point is to the right of the least-significant bit. A 32-bit unsigned integer (uint32), would be fp(32, 0). It is common to use the bit-width of the architecture as the default fixed-point length, as the architecture has native support for manipulating these variables (commonly 32-bit on the more powerful embedded microcontrollers).

Unfortunately, low-level languages like C and C++ do not have native support for fixed-point mathematics (however there are many third-party libraries out there!). C++ has a nice advantage over C in the fact that it supports operator overloading, meaning that you can write a fixed-point library so that you could multiply/divide two fixed-point numbers just by using the ‘*’ or ‘/’ syntax, just like when dealing with other native number types (in C you would have to use functions/macros).

Notation

Q is a number format used to describe fixed-point numbers. It uses the form:

$Qi.f$

$i$ = number of integer bits

$f$ = number of fractional bits

The Range Of Fixed-Point Numbers

The range of an unsigned fixed-point number with $i$ bits for the integer and $f$ bits for the decimal parts is:

$0 \textnormal{ to } (2^i -1) + 2^{-f} \times (2^{f} – 1)$

For example, an 8-bit fixed-point number with 5 bits for the integer and 3 bits for the fractional part (Q5.3) would have a range from 0 to 31.875.

The Precision Of Fixed-Point Numbers

The precision of a fixed-point number is determined solely by the number of fractional bits. The precision is equal to:

$2^{-f}$

Converting To Fixed-Point

All code is in C++.

Adding/Subtracting Fixed-Point Numbers

If the numbers had the same precision, they can be added directly without any manipulation. Be wary of overflowing though! To ensure no overflow, the resultant fixed-point number has to have one more bit of integer precision than that of the inputs.

If the numbers have different precision, they must be converted to the same precision before adding. Either number can be converted to the precision of the other by bit shifting, but you must be aware that information could be lost in the process.

Multiplying Fixed-Point Numbers

Multiplying any fixed-point number other than one with no fractional part (e.g. a standard int32_t , but no one really treats that as a “fixed-point” number anyway) requires a standard multiplication, and then a division (which happens to be an easy bit shift in code).

Because the end result is less than intermediatary result from multiplying the two numbers together, care has to be taken to make sure that it does not overflow. One way to do this is to cast the inputs into a data type twice as large (in terms of bits). If using int32_t  fixed-point numbers, this is possible with a cast to int64_t , which most compilers support (including embedded ones, such as GCC).

Embedded C++ Fixed-Point Library

I have written an embedded C++ fixed-point library. It is freely available for download from GitHub here.

External Resources

See Fixed-Point Representation and Fractional Math by Erick L. Oberstar.

Tonc: Fixed Point Numbers And LUTs is a good tutorial of fixed-point numbers and how they are implemented in a computer software/hardware.

From The Book of Hook, An Introduction To Fixed Point Math is another great resource.