Two's-complement Fixed-point Basic Math

Fixed-point values are fractions. Basic math with fixed-point values is basic math with fractions.

Sep 11, 2023

This is part 8 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.

For this example we’ll use a signed 16 bit integer with 10 bits fractional part. As was introduced in fixed-point as fractions, the value stored is the numerator of the fraction:

\(\frac {n}{1024}\)

Some example values:

Addition and Subtraction

Addition of fixed-point values stored as two’s-complement is not different from any other two’s-complement integer addition.

\(\frac {c}{1024} = \frac {a}{1024} + \frac {b}{1024}\)

\(\frac {c}{1024} = \frac {a+b}{1024}\)

\(c = a+b\)

int16_t 
add( int16_t a, int16_t b)
{
  return a+b;
}

EXERCISE 8-1: Create a version of add with saturate. 

  * Overflow: If the sum would be greater than the largest representable value, the largest representable value is returned.
  * Underflow: If the sum would be smaller than the smallest representable value, the smallest representable value is returned.

Subtraction can be implemented as integer subtract:

\(\frac {c}{1024} = \frac {a}{1024} - \frac {b}{1024}\)

\(\frac {c}{1024} = \frac {a-b}{1024}\)

\(c = a-b\)

int16_t 
sub( int16_t a, int16_t b)
{
  return a-b;
}

Alternatively, subtraction can be implemented in terms of addition.

int16_t sub( int16_t a, int16_t b)
{
  return add(a,-b);
}

Multiplication

\(\frac {c}{1024} = \frac {a}{1024} \cdot \frac {b}{1024}\)

\(\frac {c}{1024} = \frac {a \cdot b}{1024 \cdot 1024} \)

\(c = \frac {a \cdot b}{1024 } \)

The sub-expression (a*b) requires a 32 bit result, as two 16 bit multiples can produce a 32 bit result. That result is then reduced by 10 bits (divided by 1024), however that still leaves potentially up to a 22 bit result, so the result must still be returned as a 32 bit value.

int32_t 
mul( int16_t a, int16_t b)
{
  return (a*b)/1024;
}

If using right shift instead of divide with signed integers, values will be rounded up when negative. To truncate instead (round toward zero):

int32_t 
mul( int16_t a, int16_t b )
{
  int32_t c_pos   = a*b;
  int32_t c_neg   = c_pos+1023;
  int32_t c_sign  = c_pos >> 31;
  int32_t c       = (c_sign & c_neg) | ((~c_sign) & c_pos);
  return c >> 10;
}

Another common approach is to round half away from zero:

int32_t 
mul( int16_t a, int16_t b )
{
  int32_t c_mul   = a*b;
  int32_t c_pos   = c_mul + 512;
  int32_t c_neg   = c_pos + 1023;
  int32_t c_sign  = c_mul >> 31;
  int32_t c       = (c_sign & c_neg) | ((~c_sign) & c_pos);
  return c >> 10;
}

EXERCISE 8-2: Implement mul() with round half to even, which is just like round half away from zero except that on exactly 1/2 boundaries, the nearest even value is chosen. e.g.
 3 * 1/2 =  1 1/2 =  2
 5 * 1/2 =  2 1/2 =  2
 7 * 1/2 =  3 1/2 =  4
 9 * 1/2 =  4 1/2 =  4
-3 * 1/2 = -1 1/2 = -2
-5 * 1/2 = -2 1/2 = -2
-7 * 1/2 = -3 1/2 = -4
-9 * 1/2 = -4 1/2 = -4

Absolute value

The absolute value of fixed-point values stored as signed integers is the same as with any other signed integer:

int16_t
int16_abs( int16_t a )
{
    return (a<0)?(-a):a;
}

Or alternatively, because it’s in two’s complement:

int16_t
int16_abs( int16_t a )
{
  int16_t a0 = a >> 15;
  int16_t a1 = a ^ a0;
  int16_t a2 = a1 - a0;
  return a2;
}

Division

\(\frac {c}{1024} = \frac { \frac {a}{1024} } { \frac {b}{1024} }\)

\(\frac {c}{1024} = \frac {a}{1024} \cdot \frac {1024}{b}\)

\(\frac {c}{1024} = \frac {a}{b} \)

\(c = \frac {1024 \cdot a}{b} \)

The sub-expression (a*1024) requires a 32 bit result, and if b is less than 1024, the final result may still not fit in 16 bits.

int32_t 
div( int16_t a, int16_t b)
{
  return (a<<10)/b;
}

When using round half away from zero, since negative and positive values are mirrored, unsigned division can be used:

int32_t
div( int16_t a, int16_t b )
{
  int32_t  sign = (a ^ b) >> 15;              // sign
  uint32_t ua   = int16_abs(a);               // magnitude(a)
  uint32_t ub   = int16_abs(b);               // magnitude(b)
  uint32_t uc   = ((ua << 10)+(ub>>1)) / ub;  // div round half from zero
  int32_t  c    = (uc + sign)^sign;           // to two's complement
  return c;
}

EXERCISE 8-3: Compare and contrast the approach to calculating round half from zero in mul() and div().

REFERENCE: See libfixmath for implemention of Q16.16 format fixed point operations in C.

Next: Part 9

Floating-point Addition and Subtraction - Floating-point values are fractions. Addition and subtracting floating-point numbers is adding and subtracting fractions.

Thank you for reading AltDevArts. This post is public so feel free to share it.

Get 10% off a group subscription

AltDevArts