Floating-point Division
Floating-point values are fractions. Dividing floating-point numbers is dividing fractions.
This is part 11 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.
We’ll continue using the s10e5 floating-point format from floating point as fractions.
We can express s10e5 as a fraction:
Except where exp15 is zero:
TERMS
-----------------------------------------------------------------------------
| sign | x15 | m |
|-------|---------------------|---------------------------------------------|
| 1 bit | 5 bits | 10 bits |
-----------------------------------------------------------------------------
| sign | 1 bit value exactly as stored (1=negative, 0=positive) |
| x15 | 5 bit value exactly as stored as offset-15 |
| m | 10 bit value exactly as stored as unsigned integer |
| x | exponent as signed integer |
| s | sign as signed integer (-1=negative, 1=positive) |
| magnitude | unsigned value as f(x15,m) |
Division
As with multiplication, if we rearrange the fractions, we can express a s10e5 floating-point as:
Except for where exp-15 is zero:
Dividing two values (c = a/b):
And if we replace
Re-write as:
Subtract 1024 to extract mantissa:
Re-write in the expected form:
uint32_t
diva(uint16_t a, uint16_t b)
{
//
// Extract components of a,b where:
// s = negative=-1, positive=0
// m = 10 bit unsigned mantissa
// x15 = 5 bit exponent in offset-15
// x15nz = (x15 != 0)?-1:0
// x = if (x15 == 0) -14
// else x15 as two's-complement
//
uint16_t a_s = (int16_t)a >> 15;
uint16_t a_m = a & 0x03ff;
uint16_t a_x15 = (a & 0x7c00) >> 10;
uint16_t a_x15nz = (int16_t)(-a_x15)>>15;
int16_t a_x = (a_x15nz & (a_x15-15)) | ((~a_x15nz) & -14);
uint16_t b_s = (int16_t)b >> 15;
uint16_t b_m = b & 0x03ff;
uint16_t b_x15 = (b & 0x7c00) >> 10;
uint16_t b_x15nz = (int16_t)(-b_x15)>>15;
int16_t b_x = (b_x15nz & (b_x15-15)) | ((~b_x15nz) & -14);
//
// Create significand of a,b where:
// if (x15 != 0) significand = 1024 + m
// else significand = m
//
uint16_t a_significand = a_m + (a_x15nz&1024);
uint16_t b_significand = b_m + (b_x15nz&1024);
//
// Divide
//
uint16_t c_s = a_s ^ b_s;
uint32_t c_significand = ((uint32_t)a_significand << 10) / b_significand;
int16_t c_x = a_x - b_x;
//
// Normalize
//
uint32_t signz = (int32_t)(-c_significand) >> 31;
int16_t sigclz31 = __builtin_clz( c_significand );
int16_t sigclz = (signz & sigclz31)|((~signz)&32);
int16_t norm_sa = (sigclz - 21);
uint16_t norm_sa_ltz = norm_sa >> 15;
int16_t norm_x = c_x - norm_sa;
int16_t significand_rsa = norm_sa_ltz & (-norm_sa);
int16_t significand_lsa = (~norm_sa_ltz) & norm_sa;
uint16_t norm_significand = (c_significand << significand_lsa) >> significand_rsa;
uint16_t c = (c_s << 15)|((norm_x+15) << 10)|(norm_significand & 0x3ff);
return c;
}
For a comparison with outputs of compiler-generated float conversions, see:
https://godbolt.org/z/bK88h75oh
Next: Part 12
Floating-point further reading - Recommended reading.