Floating-point Multiplication
Floating-point values are fractions. Multiplying floating-point numbers is multiplying fractions.
This is part 10 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.
We’ll continue using the s10e5 floating-point format from floating point as fractions.
We can express s10e5 as a fraction:
Except where exp15 is zero:
TERMS
-----------------------------------------------------------------------------
| sign | x15 | m |
|-------|---------------------|---------------------------------------------|
| 1 bit | 5 bits | 10 bits |
-----------------------------------------------------------------------------
| sign | 1 bit value exactly as stored (1=negative, 0=positive) |
| x15 | 5 bit value exactly as stored as offset-15 |
| m | 10 bit value exactly as stored as unsigned integer |
| x | exponent as signed integer |
| s | sign as signed integer (-1=negative, 1=positive) |
| magnitude | unsigned value as f(x15,m) |
Multiplication
If we rearrange the fractions, we can define s10e5 floating-point as:
Except for where exp-15 is zero:
Multiplying two values (c = a*b):
And if we replace:
We can re-write the result as:
We can extract the mantissa as with addition. And like addition, at this point the fraction may need to be normalized if significand_c is too large or too small for mantissa to fit in 10 bits unsigned.
Which gives us the result in the correct form:
Except in the case where x15 is zero:
Then normalize.
uint32_t
mul(uint16_t a, uint16_t b)
{
//
// Extract components of a,b where:
// s = negative=-1, positive=0
// m = 10 bit unsigned mantissa
// x15 = 5 bit exponent in offset-15
// x15nz = (x15 != 0)?-1:0
// x = if (x15 == 0) -14
// else x15 as two's-complement
//
uint16_t a_s = (int16_t)a >> 15;
uint16_t a_m = a & 0x03ff;
uint16_t a_x15 = (a & 0x7c00) >> 10;
uint16_t a_x15nz = (int16_t)(-a_x15)>>15;
int16_t a_x = (a_x15nz & (a_x15-15)) | ((~a_x15nz) & -14);
uint16_t b_s = (int16_t)b >> 15;
uint16_t b_m = b & 0x03ff;
uint16_t b_x15 = (b & 0x7c00) >> 10;
uint16_t b_x15nz = (int16_t)(-b_x15)>>15;
int16_t b_x = (b_x15nz & (b_x15-15)) | ((~b_x15nz) & -14);
//
// Create significand of a,b where:
// if (x15 != 0) significand = 1024 + m
// else significand = m
//
uint16_t a_significand = a_m + (a_x15nz&1024);
uint16_t b_significand = b_m + (b_x15nz&1024);
//
// Multiply
// - 10 bit fixed-point multiply significands as (a*b)>>10
// - sign(a) * sign(b)
// - half round up
// - defer shift-right 10 bits until normalization
// in case left-shift is needed, so we don't lose
// the values.
//
uint16_t c_s = a_s ^ b_s;
uint32_t c_significand = a_significand * b_significand;
int16_t c_x = a_x + b_x;
//
// Normalize
//
// clz(c_significand) = (0,32)
uint32_t signz = (int32_t)(-c_significand) >> 31;
int16_t sigclz31 = __builtin_clz( c_significand );
int16_t sigclz = (signz & sigclz31)|((~signz)&32);
// reduce by exponent until 10 bits of mantissa
int16_t hi_sa = sigclz-11;
int16_t norm_x = c_x - hi_sa;
uint16_t norm_x_underflow = (norm_x+14) >> 15;
uint16_t norm_x15 = ((~norm_x_underflow) & (norm_x+15));
int16_t norm_sa = ((~norm_x_underflow) & (hi_sa-10))
| (norm_x_underflow & (c_x+4));
// if norm_sa < 0, then right shift, else left shift.
uint16_t norm_sa_sign = norm_sa >> 15;
int16_t norm_rsa = norm_sa_sign & (-norm_sa);
int16_t norm_lsa = (~norm_sa_sign) & norm_sa;
uint16_t norm_significand = (c_significand << norm_lsa) >> norm_rsa;
//
// Store
//
uint16_t c_m = norm_significand & 0x3ff;
uint16_t c = (c_s << 15) | (norm_x15 << 10) | c_m;
return c;
}
Exercise 10-1: Create a version of mul() for s23e8 floating-point. Compare results to using float instructions.
REFERENCE: See Berkely SoftFloat [github.com/ucb-bar] for implementations of common operations for different floating-point formats.
Normalization as compression
The multiplication of significands is a 10 bit fixed-point multiply. Consider the result as a 20 bit value. As covered in floating-point as compression, you can also think of the normalization process as a pattern to compress that 20 bit result into a 10 bit mantissa and change to the exponent.
| significand (20 bit) | m | change to x |
|-----------------------|------------|-------------|
| 1AAAAAAAAAAxxxxxxxxx | AAAAAAAAAA | x+9 |
| 01AAAAAAAAAAxxxxxxxx | AAAAAAAAAA | x+8 |
| 001AAAAAAAAAAxxxxxxx | AAAAAAAAAA | x+7 |
| 0001AAAAAAAAAAxxxxxx | AAAAAAAAAA | x+6 |
| 00001AAAAAAAAAAxxxxx | AAAAAAAAAA | x+5 |
| 000001AAAAAAAAAAxxxx | AAAAAAAAAA | x+4 |
| 0000001AAAAAAAAAAxxx | AAAAAAAAAA | x+3 |
| 00000001AAAAAAAAAAxx | AAAAAAAAAA | x+2 |
| 000000001AAAAAAAAAAx | AAAAAAAAAA | x+1 |
| 0000000001AAAAAAAAAA | AAAAAAAAAA | x |
| 00000000001AAAAAAAAA | AAAAAAAAA0 | x-1 |
| 000000000001AAAAAAAA | AAAAAAAA00 | x-2 |
| 0000000000001AAAAAAA | AAAAAAA000 | x-3 |
| 00000000000001AAAAAA | AAAAAA0000 | x-4 |
| 000000000000001AAAAA | AAAAA00000 | x-5 |
| 0000000000000001AAAA | AAAA000000 | x-6 |
| 00000000000000001AAA | AAA0000000 | x-7 |
| 000000000000000001AA | AA00000000 | x-8 |
| 0000000000000000001A | A000000000 | x-9 |
Conversion to float
For a comparison with outputs of compiler-generated float conversions, see:
https://godbolt.org/z/E4aderGfG
Next: Part 11
Floating-point Division - Floating-point values are fractions. Dividing floating-point numbers is dividing fractions.