Fixed-point as fractions
Fixed-point formats describe fractions where both numerator and denominator are integers.
This is part 2 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.
Fixed-point and floating-point formats are used to describe rational numbers, or fractions where both numerator and denominator are integers.
This should be familiar. For instance, if we are measuring real-world distances in meters, we could use a structure like this:
struct distance
{
int km;
int m;
int cm;
int mm;
};
Where a distance in meters as a fraction would be:
As a convenience, we might want to input or display our distance in meters in decimal form (e.g. 1.5m), but that value would still be stored as a fraction and we would need conversion functions to and from decimal form.
Sign-magnitude
If we want to store negative as well as positive distances, we might notice that either all the values are negative or all the values are positive.
Since all values share a sign, we can store that just once and use a signed-magnitude form to store our distance (where s=-1 or s=1)
struct distance
{
// sign
int sign; // 1=positive, -1=negative
// magnitude
unsigned int km;
unsigned int m;
unsigned int cm;
unsigned int mm;
};
Now let’s consider the sizes for the fields. int defaults to 32-bit.
struct distance
{
// sign
int32_t sign; // 1=positive, -1=negative
// magnitude
uint32_t km;
uint32_t m;
uint32_t cm;
uint32_t mm;
};
We know that m only needs values between 0-999, cm only needs values between 0-99, mm only needs values between 0-9, and sign only stores values -1 or 1. So let’s change our default integer value to 16-bits.
struct distance
{
// sign
int16_t sign; // 1=positive, -1=negative
// magnitude
uint16_t km;
uint16_t m;
uint16_t cm;
uint16_t mm;
};
We can store between 0-65535 in a uint16_t, which is enough space to store mm without storing cm.
Additionally, if we know the distances we’ll be working with won’t be greater than 65 kilometers, we can also store m without storing km.
struct distance
{
// sign
int16_t sign; // 1=positive, -1=negative
// magnitude
uint16_t m;
uint16_t mm;
};
sign only stores -1 (negative) or 1 (positive), which can be stored in one bit. Conventionally, we can store that as 1 (negative) and 0 (positive).
m stores between 0-65535 in 16 bits.
mm only needs to store 0-1000, so 10 bits (0-1023) is sufficient.
So we might store distance in a uint32_t as follows:
| sign | meter (m) | unused | millimeter (mm) |
|-------|---------------------------------|--------------|-------------------|
| 1 bit | 16 bits | 5 bits | 10 bits |
And given distance (d) as uint32_t, we can extract the values as:
uint16_t s = distance >> 31;
uint16_t m = (distance >> 15) & 0x0000ffff;
uint16_t mm = distance & 0x000003ff;
Like most things, power of two values are a lot more convenient to store than powers of ten. We can take advantage of the nature of binary numbers if instead of millimeter (1/1000) resolution, we use a resolution of 1/1024m.
| sign | meter (m) | unused | 1/1024m |
|-------|---------------------------------|--------------|-------------------|
| 1 bit | 16 bits | 5 bits | 10 bits |
There are 31 bits of available storage for the magnitude. More than enough to store our fractional unit without separately storing meters.
| sign | 1/1024m |
|-------|--------------------------------------------------------------------|
| 1 bit | 31 bits |
We can extract sign-magnitude as:
uint32_t sign = distance >> 31;
uint32_t magnitude = distance & 0x7fffffff;
One meter is stored as 1024 fractional units, those can be extracted as:
uint32_t meters = magnitude >> 10;
uint32_t fractional = magnitude & 0x3ff;
This is an example of what would conventionally be called a 1:21:10 fixed-point format for 1 bit sign, 21 bits whole part, and 10 bits fractional part.
In the way of historically poor naming, we might still use familiar decimal prefixes for binary values. We would know that meter maps 1:1 with the real-life measure, but other measures would conventionally be in terms of powers-of-two.
| kilo- | 1024x | kilometer = 1024m |
| milli- | 1/1024x | millimeter = 1/1024m |
We could extract those values from the magnitude as:
uint32_t km = (magnitude >> 20) & 0x7ff; // 11 bits (0-2047)
uint32_t m = (magnitude >> 10) & 0x3ff; // 10 bits (0-1023)
uint32_t mm = magnitude & 0x3ff; // 10 bits (0-1023)
Two’s-complement
Alternatively, we can store our “millimeters” as two’s-complement int32_t values, by storing only the signed integer numerator (n) of the fraction:
This format would conventionally be called Q10 (for 10 bit fractional part). It works similarly to the sign-magnitude format (e.g. n=1 is 1/1024 meter, or n=1024 is 1 meter.)
It can be converted to sign-magnitude to extract mm, m, and km separately as above.
One advantage to storing in the Q-format, is that casting from one storage size to another works in the expected way for two’s-complement.
int64_t
convert_to_64bit( int32_t q10 )
{
return q10;
}
REFERENCE: You might want to check out Chris Hummersone’s Q-format converter and calculator.
Take-Away
How rational numbers can be stored in integer formats and a specific example of using 1:21:10 and Q10 fixed-point formats to store distance in meters.
Next: Part 3
Estimating fixed-point range and resolution - An example of selecting a fixed-point format based on the context.