Two's-complement Fixed-point Basic Math
Fixed-point values are fractions. Basic math with fixed-point values is basic math with fractions.
This is part 8 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.
For this example we’ll use a signed 16 bit integer with 10 bits fractional part. As was introduced in fixed-point as fractions, the value stored is the numerator of the fraction:
Some example values:
Addition and Subtraction
Addition of fixed-point values stored as two’s-complement is not different from any other two’s-complement integer addition.
int16_t
add( int16_t a, int16_t b)
{
return a+b;
}
EXERCISE 8-1: Create a version of add with saturate.
* Overflow: If the sum would be greater than the largest representable value, the largest representable value is returned.
* Underflow: If the sum would be smaller than the smallest representable value, the smallest representable value is returned.
Subtraction can be implemented as integer subtract:
int16_t
sub( int16_t a, int16_t b)
{
return a-b;
}
Alternatively, subtraction can be implemented in terms of addition.
int16_t sub( int16_t a, int16_t b)
{
return add(a,-b);
}
Multiplication
The sub-expression (a*b) requires a 32 bit result, as two 16 bit multiples can produce a 32 bit result. That result is then reduced by 10 bits (divided by 1024), however that still leaves potentially up to a 22 bit result, so the result must still be returned as a 32 bit value.
int32_t
mul( int16_t a, int16_t b)
{
return (a*b)/1024;
}
If using right shift instead of divide with signed integers, values will be rounded up when negative. To truncate instead (round toward zero):
int32_t
mul( int16_t a, int16_t b )
{
int32_t c_pos = a*b;
int32_t c_neg = c_pos+1023;
int32_t c_sign = c_pos >> 31;
int32_t c = (c_sign & c_neg) | ((~c_sign) & c_pos);
return c >> 10;
}
Another common approach is to round half away from zero:
int32_t
mul( int16_t a, int16_t b )
{
int32_t c_mul = a*b;
int32_t c_pos = c_mul + 512;
int32_t c_neg = c_pos + 1023;
int32_t c_sign = c_mul >> 31;
int32_t c = (c_sign & c_neg) | ((~c_sign) & c_pos);
return c >> 10;
}
EXERCISE 8-2: Implement mul() with round half to even, which is just like round half away from zero except that on exactly 1/2 boundaries, the nearest even value is chosen. e.g.
3 * 1/2 = 1 1/2 = 2
5 * 1/2 = 2 1/2 = 2
7 * 1/2 = 3 1/2 = 4
9 * 1/2 = 4 1/2 = 4
-3 * 1/2 = -1 1/2 = -2
-5 * 1/2 = -2 1/2 = -2
-7 * 1/2 = -3 1/2 = -4
-9 * 1/2 = -4 1/2 = -4
Absolute value
The absolute value of fixed-point values stored as signed integers is the same as with any other signed integer:
int16_t
int16_abs( int16_t a )
{
return (a<0)?(-a):a;
}
Or alternatively, because it’s in two’s complement:
int16_t
int16_abs( int16_t a )
{
int16_t a0 = a >> 15;
int16_t a1 = a ^ a0;
int16_t a2 = a1 - a0;
return a2;
}
Division
The sub-expression (a*1024) requires a 32 bit result, and if b is less than 1024, the final result may still not fit in 16 bits.
int32_t
div( int16_t a, int16_t b)
{
return (a<<10)/b;
}
When using round half away from zero, since negative and positive values are mirrored, unsigned division can be used:
int32_t
div( int16_t a, int16_t b )
{
int32_t sign = (a ^ b) >> 15; // sign
uint32_t ua = int16_abs(a); // magnitude(a)
uint32_t ub = int16_abs(b); // magnitude(b)
uint32_t uc = ((ua << 10)+(ub>>1)) / ub; // div round half from zero
int32_t c = (uc + sign)^sign; // to two's complement
return c;
}
EXERCISE 8-3: Compare and contrast the approach to calculating round half from zero in mul() and div().
REFERENCE: See libfixmath for implemention of Q16.16 format fixed point operations in C.
Next: Part 9
Floating-point Addition and Subtraction - Floating-point values are fractions. Addition and subtracting floating-point numbers is adding and subtracting fractions.