#### Discover more from AltDevArts

# Floating-point Addition and Subtraction

### Floating-point values are fractions. Addition and subtracting floating-point numbers is adding and subtracting fractions.

*This is part 9 of the onboarding floating point series. This series is intended to be used for onboarding of programmers new to the team to review a basic understanding of fixed-point and floating-point number formats, or for programmers who would like to remove some of the mystery from formats they may use everyday.*

We’ll continue using the s10e5 floating-point format from floating point as fractions.

We can express s10e5 as a fraction:

Except where exp15 is zero:

```
TERMS
-----------------------------------------------------------------------------
| sign | x15 | m |
|-------|---------------------|---------------------------------------------|
| 1 bit | 5 bits | 10 bits |
-----------------------------------------------------------------------------
| sign | 1 bit value exactly as stored (1=negative, 0=positive) |
| x15 | 5 bit value exactly as stored as offset-15 |
| m | 10 bit value exactly as stored as unsigned integer |
| x | exponent as signed integer |
| s | sign as signed integer (-1=negative, 1=positive) |
| magnitude | unsigned value as f(x15,m) |
```

## Addition

If we rearrange the fraction, we can express s10e5 as:

Except for where x15 is zero:

In order to add two fractions (c = a+b), we need a common denominator.

And if we replace:

Which tells us that (assuming x_a is larger than x_b), significand_b just needs to be right-shifted by the difference between the exponents and added to significand_a as signed integers. With the result being in terms of the exponent x_a.

And we need to find the mantissa to store:

However, m_c is expected to be stored as an unsigned integer of 10 bits. So if singificand_c is too large, we need to first carry to the exponents. If it is too small, we need to first borrow from the exponent. That step is called *normalizing* the fraction.

```
uint16_t
adda( uint16_t u, uint16_t v )
{
//
// assign a,b where abs(a) > abs(b)
//
uint16_t u_magnitude = u & 0x7fff;
uint16_t v_magnitude = v & 0x7fff;
uint16_t u_gt = (int16_t)(v_magnitude-u_magnitude) >> 15;
uint16_t a = (u_gt & u) | ((~u_gt) & v);
uint16_t b = (u_gt & v) | ((~u_gt) & u);
//
// Extract components of a,b where:
// s = negative=-1, positive=0
// m = 10 bit unsigned mantissa
// x15 = 5 bit exponent in offset-15
// x15nz = (x15 != 0)?-1:0
// x = if (x15 == 0) -14
// else x15 as two's-complement
//
uint16_t a_s = (int16_t)a >> 15;
uint16_t a_m = a & 0x03ff;
uint16_t a_x15 = (a & 0x7c00) >> 10;
uint16_t a_x15nz = (int16_t)(-a_x15)>>15;
int16_t a_x = (a_x15nz & (a_x15-15)) | ((~a_x15nz) & -14);
uint16_t b_s = (int16_t)b >> 15;
uint16_t b_m = b & 0x03ff;
uint16_t b_x15 = (b & 0x7c00) >> 10;
uint16_t b_x15nz = (int16_t)(-b_x15)>>15;
int16_t b_x = (b_x15nz & (b_x15-15)) | ((~b_x15nz) & -14);
//
// Create significand of a,b where:
// if (x15 != 0) significand = (s?-1:1) * (1024 + m)
// else significand = (s?-1:1) * m
//
int16_t a_significand = ((a_m + (a_x15nz&1024)) + a_s)^a_s;
int16_t b_significand = ((b_m + (b_x15nz&1024)) + b_s)^b_s;
// Sum
// - shift right b_significand by -(b_x-a_x)
// - half round up
int16_t b_significand_sa = a_x - b_x;
int16_t b_significand_round = 1 << (b_significand_sa-1);
int16_t c_significand = a_significand + ((b_significand + b_significand_round) >> b_significand_sa);
//
// Extract sign, magnitude from c_significand where
// s = negative=-1, positive=0
// usignificand = abs(significand)
//
uint16_t c_s = (int16_t)c_significand >> 15;
uint16_t c_usignificand = (c_significand ^ c_s) - c_s;
//
// Check for carry or borrow
// 1. Check top bit of c_unisignificand
// - 0x400 = 1024. c_x = c_a
// - Each bit higher than 0x400, carry to c_x.
// - Each bit lower than 0x400, borrow from c_x.
// 2. c_m = c_significand - 1024
//
int16_t norm_sa = __builtin_clz( c_usignificand )-21;
uint16_t c_x = a_x - norm_sa;
uint16_t norm_sa_s = (int16_t)norm_sa >> 15;
int16_t norm_rsa = norm_sa_s & (-norm_sa);
int16_t norm_lsa = (~norm_sa_s) & norm_sa;
uint16_t c_m = ((c_usignificand << norm_lsa) >> norm_rsa)&0x03ff;
//
// Store
//
uint16_t c_x15 = c_x + 15;
uint16_t c = (c_s << 15) | ( c_x15 << 10 ) | c_m;
return c;
}
```

## Subtraction

Subtraction can be implemented as addition:

```
uint16_t
sub( uint16_t a, uint16_t b )
{
return add(a,b^0x8000);
}
```

## Using available float instructions

Since half is a s10e5 floating-point format, instructions which work with half can be used directly.

For instance on x64 to convert from half (s10e5 floating-point) to float (s23e8 floating-point):

```
uint16_t float_to_half( float x )
{
return _cvtss_sh(x, 0);
}
float half_to_float( uint16_t d )
{
return _cvtsh_ss(d);
}
```

**REFERENCE**: Details About Intrinsics for Half Floats [intel.com]

So alternatively, you can use float instructions available in hardware.

```
uint16_t
add( uint16_t a, uint16_t b )
{
float fa = _cvtsh_ss(a);
float fb = _cvtsh_ss(b);
float fc = fa + fb;
uint16_t c = _cvtss_sh(fc, 0);
}
```

**REFERENCE:** See half.h [github.com/AcademySoftwareFoundation] from OpenEXR for an example of using float for half operations.

Also, for C, in GCC (v13+) and Clang (v15+), _Float16 is a built-in type and equivalent to half and does operations in float automatically when available. So the above is also equivalent to:

```
_Float16
add( _Float16 a, _Float16 b)
{
return a+b;
}
```

**REFERENCE:** For more details on GCC support for half, see:6.13 Half-Precision Floating Point [gcc.gnu.org]

For a comparison of the compiler outputs of these variations, see: https://godbolt.org/z/9raj6nfos

**Exercise 9-1:**` Create a version of abs() for s23e8 floating-point.`

**Exercise 9-2**: Create a version of add() for s23e8 floating-point. Compare results to using float instructions.

## Next: Part 10

Floating-point Multiplication - Floating-point values are fractions. Multiplying floating-point numbers is multiplying fractions.