Floating-point Addition and Subtraction

Floating-point values are fractions. Addition and subtracting floating-point numbers is adding and subtracting fractions.

Sep 11, 2023

This is part 9 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.

We’ll continue using the s10e5 floating-point format from floating point as fractions.

We can express s10e5 as a fraction:

\(s \cdot \left( 2^{x} + \frac{2^{x}m}{1024} \right)\)

Except where exp15 is zero:

\(s \cdot \left( \frac{2^{-14}m}{1024} \right)\)

TERMS
-----------------------------------------------------------------------------
| sign  | x15                 | m                                           |
|-------|---------------------|---------------------------------------------|
| 1 bit | 5 bits              | 10 bits                                     |
-----------------------------------------------------------------------------
| sign        | 1 bit value exactly as stored (1=negative, 0=positive)      |
| x15         | 5 bit value exactly as stored as offset-15                  |
| m           | 10 bit value exactly as stored as unsigned integer          |
| x           | exponent as signed integer                                  |
| s           | sign as signed integer (-1=negative, 1=positive)            |
| magnitude   | unsigned value as f(x15,m)                                  |

Addition

If we rearrange the fraction, we can express s10e5 as:

\(s \cdot \frac {1024+m} {2^{10-x}} \)

Except for where x15 is zero:

\(s \cdot \frac {m} {2^{24}}\)

In order to add two fractions (c = a+b), we need a common denominator.

\(\frac { s_{a} \left( 1024+m_{a} \right) } {2^{10-x_{a}}} + \frac { s_{b} \left( 1024+m_{b} \right) } {2^{10-x_{b}}} \)

\(\frac { s_{a} \left( 1024+m_{a} \right) } {2^{10-x_{a}} } + \frac { s_{b} \left( 1024+m_{b} \right) \cdot 2^{x_{b}-x_{a}} } {2^{10-x_{b}} \cdot 2^{x_{b}-x_{a}} } \)

\(\frac { s_{a} \left( 1024+m_{a} \right) + s_{b} \left( 1024+m_{b} \right) \cdot 2^{x_{b}-x_{a}} } {2^{10-x_{a}} }\)

And if we replace:

\(significand_{a} = s_{a}(1024+m_{a})\)

\(significand_{b} = s_{b}(1024+m_{b})\)

\(\frac { significand_{a} + significand_{b} \cdot 2^{x_{b}-x_{a}} } {2^{10-x_{a}} }\)

Which tells us that (assuming x_a is larger than x_b), significand_b just needs to be right-shifted by the difference between the exponents and added to significand_a as signed integers. With the result being in terms of the exponent x_a.

\(significand_{c} = significand_{a} + significand_{b} \cdot 2^{x_{b}-x_{a}} \)

And we need to find the mantissa to store:

\(m_{c} = abs(significand_{c})-1024\)

However, m_c is expected to be stored as an unsigned integer of 10 bits. So if singificand_c is too large, we need to first carry to the exponents. If it is too small, we need to first borrow from the exponent. That step is called normalizing the fraction.

uint16_t
adda( uint16_t u, uint16_t v )
{
  // 
  // assign a,b where abs(a) > abs(b)
  //

  uint16_t u_magnitude = u & 0x7fff;
  uint16_t v_magnitude = v & 0x7fff;
  uint16_t u_gt        = (int16_t)(v_magnitude-u_magnitude) >> 15;
  uint16_t a           = (u_gt & u) | ((~u_gt) & v);
  uint16_t b           = (u_gt & v) | ((~u_gt) & u);

  //
  // Extract components of a,b where:
  //   s     = negative=-1, positive=0
  //   m     = 10 bit unsigned mantissa
  //   x15   = 5 bit exponent in offset-15
  //   x15nz = (x15 != 0)?-1:0
  //   x     = if (x15 == 0) -14
  //           else x15 as two's-complement 
  //

  uint16_t a_s     = (int16_t)a >> 15; 
  uint16_t a_m     = a & 0x03ff;
  uint16_t a_x15   = (a & 0x7c00) >>  10;
  uint16_t a_x15nz = (int16_t)(-a_x15)>>15;
  int16_t  a_x     = (a_x15nz & (a_x15-15)) | ((~a_x15nz) & -14);

  uint16_t b_s     = (int16_t)b >> 15; 
  uint16_t b_m     = b & 0x03ff;
  uint16_t b_x15   = (b & 0x7c00) >>  10;
  uint16_t b_x15nz = (int16_t)(-b_x15)>>15;
  int16_t  b_x     = (b_x15nz & (b_x15-15)) | ((~b_x15nz) & -14);

  //
  // Create significand of a,b where:
  //  if (x15 != 0) significand = (s?-1:1) * (1024 + m) 
  //  else          significand = (s?-1:1) * m
  //

  int16_t a_significand = ((a_m + (a_x15nz&1024)) + a_s)^a_s;
  int16_t b_significand = ((b_m + (b_x15nz&1024)) + b_s)^b_s;

  // Sum
  // - shift right b_significand by -(b_x-a_x)
  // - half round up

  int16_t  b_significand_sa    = a_x - b_x;
  int16_t b_significand_round = 1 << (b_significand_sa-1);
  int16_t  c_significand       = a_significand + ((b_significand + b_significand_round) >> b_significand_sa);

  // 
  // Extract sign, magnitude from c_significand where
  //    s            = negative=-1, positive=0
  //    usignificand = abs(significand)
  //

  uint16_t c_s            = (int16_t)c_significand >> 15;
  uint16_t c_usignificand = (c_significand ^ c_s) - c_s;

  //
  // Check for carry or borrow
  // 1. Check top bit of c_unisignificand
  //   - 0x400 = 1024. c_x = c_a
  //   - Each bit higher than 0x400, carry to c_x.
  //   - Each bit lower than 0x400, borrow from c_x.
  // 2. c_m = c_significand - 1024
  //

  int16_t  norm_sa   = __builtin_clz( c_usignificand )-21;
  uint16_t c_x       = a_x - norm_sa;
  uint16_t norm_sa_s = (int16_t)norm_sa >> 15;
  int16_t  norm_rsa  = norm_sa_s & (-norm_sa);
  int16_t  norm_lsa  = (~norm_sa_s) & norm_sa;
  uint16_t c_m       = ((c_usignificand << norm_lsa) >> norm_rsa)&0x03ff;

  //
  // Store
  //

  uint16_t c_x15 = c_x + 15;
  uint16_t c     = (c_s << 15) | ( c_x15 << 10 ) | c_m;

  return c;
}

Subtraction

Subtraction can be implemented as addition:

uint16_t
sub( uint16_t a, uint16_t b )
{
  return add(a,b^0x8000);
}

Using available float instructions

Since half is a s10e5 floating-point format, instructions which work with half can be used directly.

For instance on x64 to convert from half (s10e5 floating-point) to float (s23e8 floating-point):

uint16_t float_to_half( float x )
{
  return _cvtss_sh(x, 0);
}

float half_to_float( uint16_t d )
{
  return _cvtsh_ss(d);
}

REFERENCE: Details About Intrinsics for Half Floats [intel.com]

So alternatively, you can use float instructions available in hardware.

uint16_t
add( uint16_t a, uint16_t b )
{
  float    fa = _cvtsh_ss(a);
  float    fb = _cvtsh_ss(b);
  float    fc = fa + fb;
  uint16_t c  = _cvtss_sh(fc, 0);
}

REFERENCE: See half.h [github.com/AcademySoftwareFoundation] from OpenEXR for an example of using float for half operations.

Also, for C, in GCC (v13+) and Clang (v15+), _Float16 is a built-in type and equivalent to half and does operations in float automatically when available. So the above is also equivalent to:

_Float16 
add( _Float16 a, _Float16  b)
{
    return a+b;
}

REFERENCE: For more details on GCC support for half, see:6.13 Half-Precision Floating Point [gcc.gnu.org]

For a comparison of the compiler outputs of these variations, see: https://godbolt.org/z/9raj6nfos

Exercise 9-1: Create a version of abs() for s23e8 floating-point.

Exercise 9-2: Create a version of add() for s23e8 floating-point. Compare results to using float instructions.

Next: Part 10

Floating-point Multiplication - Floating-point values are fractions. Multiplying floating-point numbers is multiplying fractions.

Get 10% off a group subscription

AltDevArts

Discussion about this post