Floating-point Division

Floating-point values are fractions. Dividing floating-point numbers is dividing fractions.

Sep 11, 2023

This is part 11 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.

We’ll continue using the s10e5 floating-point format from floating point as fractions.

We can express s10e5 as a fraction:

\(s \cdot \left( 2^{x} + \frac{2^{x}m}{1024} \right)\)

Except where exp15 is zero:

\(s \cdot \left( \frac{2^{-14}m}{1024} \right)\)

TERMS
-----------------------------------------------------------------------------
| sign  | x15                 | m                                           |
|-------|---------------------|---------------------------------------------|
| 1 bit | 5 bits              | 10 bits                                     |
-----------------------------------------------------------------------------
| sign        | 1 bit value exactly as stored (1=negative, 0=positive)      |
| x15         | 5 bit value exactly as stored as offset-15                  |
| m           | 10 bit value exactly as stored as unsigned integer          |
| x           | exponent as signed integer                                  |
| s           | sign as signed integer (-1=negative, 1=positive)            |
| magnitude   | unsigned value as f(x15,m)                                  |

Division

As with multiplication, if we rearrange the fractions, we can express a s10e5 floating-point as:

\( s \cdot \left(\frac{1024+m}{1024}\right) \cdot 2^{x}\)

Except for where exp-15 is zero:

\(s \cdot \left(\frac{m}{1024}\right) \cdot 2^{-14}\)

Dividing two values (c = a/b):

\(\frac { \left(s_a \cdot \frac{1024+m_a}{1024} \cdot 2^{x_a}\right) } { \left(s_b \cdot \frac{1024+m_b}{1024} \cdot 2^{x_b}\right) }\)

\(\frac{s_a \cdot (1024+m_a) \cdot 2^{x_a} }{1024} \cdot \frac{1024}{s_b \cdot (1024+m_b) \cdot 2^{x_b}} \)

\(\frac {s_a}{s_b} \cdot \frac { (1024+m_a) \cdot 1024}{1024 \cdot (1024 + m_b)} \cdot \frac {2^{x_a}}{2^{x_b}}\)

\(\frac {s_a}{s_b} \cdot \frac { (1024+m_a) \cdot 1024}{1024 \cdot (1024 + m_b)} \cdot 2^{x_a - x_b}\)

And if we replace

\(s_c = \frac {s_a}{s_b}\)

\(x_c = x_a-x_b\)

\(significand_c = \frac { (1024+m_a) \cdot 1024}{(1024 + m_b)}\)

Re-write as:

\(s_{c} \cdot \frac { significand_{c} } { 1024 } \cdot 2^{x_{c}}\)

Subtract 1024 to extract mantissa:

\(m_c = significand_c - 1024\)

Re-write in the expected form:

\( s_c \cdot \left(\frac{1024+m_c}{1024}\right) \cdot 2^{x_c}\)

uint32_t 
diva(uint16_t a, uint16_t b)
{
  //
  // Extract components of a,b where:
  //   s     = negative=-1, positive=0
  //   m     = 10 bit unsigned mantissa
  //   x15   = 5 bit exponent in offset-15
  //   x15nz = (x15 != 0)?-1:0
  //   x     = if (x15 == 0) -14
  //           else x15 as two's-complement 
  //

  uint16_t a_s     = (int16_t)a >> 15; 
  uint16_t a_m     = a & 0x03ff;
  uint16_t a_x15   = (a & 0x7c00) >>  10;
  uint16_t a_x15nz = (int16_t)(-a_x15)>>15;
  int16_t  a_x     = (a_x15nz & (a_x15-15)) | ((~a_x15nz) & -14);

  uint16_t b_s     = (int16_t)b >> 15; 
  uint16_t b_m     = b & 0x03ff;
  uint16_t b_x15   = (b & 0x7c00) >>  10;
  uint16_t b_x15nz = (int16_t)(-b_x15)>>15;
  int16_t  b_x     = (b_x15nz & (b_x15-15)) | ((~b_x15nz) & -14);

  //
  // Create significand of a,b where:
  //  if (x15 != 0) significand = 1024 + m
  //  else          significand = m
  //

  uint16_t a_significand = a_m + (a_x15nz&1024);
  uint16_t b_significand = b_m + (b_x15nz&1024);
 
  // 
  // Divide
  //

  uint16_t c_s             = a_s ^ b_s;
  uint32_t c_significand   = ((uint32_t)a_significand << 10) / b_significand;
  int16_t  c_x             = a_x - b_x;

  //
  // Normalize
  //

  uint32_t signz    = (int32_t)(-c_significand) >> 31;
  int16_t  sigclz31 = __builtin_clz( c_significand );
  int16_t  sigclz   = (signz & sigclz31)|((~signz)&32);

  int16_t  norm_sa         = (sigclz - 21);
  uint16_t norm_sa_ltz     =  norm_sa >> 15;
  int16_t  norm_x          = c_x - norm_sa;
  int16_t  significand_rsa = norm_sa_ltz & (-norm_sa);
  int16_t  significand_lsa = (~norm_sa_ltz) & norm_sa;
  uint16_t norm_significand = (c_significand << significand_lsa) >> significand_rsa;

  uint16_t c = (c_s << 15)|((norm_x+15) << 10)|(norm_significand & 0x3ff);  
  return c;
}

For a comparison with outputs of compiler-generated float conversions, see:
https://godbolt.org/z/bK88h75oh

Next: Part 12

Floating-point further reading - Recommended reading.

Get 10% off a group subscription

AltDevArts

Discussion about this post