Adding Floating Point Numbers
Introduction
We'll do addition using the one byte
floating point representation discussed in the other class notes.
IEEE 754 single precision has so many bits to work with, that
it's simply easier to explain how floating point addition works
using a small float representation.
Addition is simple. Suppose you want to add two floating
point numbers, X and Y.
For sake of argument, assume the exponent in Y is less
than or equal to the exponent in X. Let the exponent
of Y be y and let the exponent of X be x.
Here's how to add floating point numbers.
- First, convert the two representations to scientific notation.
Thus, we explicitly represent the hidden 1.
- In order to add, we need the exponents of the two numbers to
be the same. We do this by rewriting Y. This will
result in Y being not normalized, but value is equivalent to
the normalized Y.
Add x - y to Y's exponent. Shift the radix
point of the mantissa (signficand) Y left by x -
y to compensate for the change in exponent.
- Add the two mantissas of X and the adjusted Y
together.
- If the sum in the previous step does not have a single bit
of value 1, left of the radix point, then adjust the radix
point and exponent until it does.
- Convert back to the one byte floating point representation.
Example 1
Let's add the following two numbers:
| Variable |
sign |
exponent |
fraction |
| X |
0 |
1001 |
110 |
| Y |
0 |
0111 |
000 |
Here are the steps again:
- First, convert the two representations to
scientific notation. Thus, we explicitly represent the hidden
1.
In normalized scientific notation, X is 1.110 x
22, and Y is 1.000 x 20.
- In order to add, we need the exponents of the
two numbers to be the same. We do this by rewriting Y.
This will result in Y being not normalized, but value is
equivalent to the normalized Y.
Add x - y to Y's exponent.
Shift the radix point of the
mantissa (signficand) Y left by x - y to compensate
for the change in exponent.
The difference of the exponent is 2. So, add 2 to Y's
exponent, and shift the radix point left by 2. This results in
0.0100 x 22. This is still equivalent to
the old value of Y. Call this readjusted value, Y'
- Add the two mantissas of X and the
adjusted Y' together.
We add 1.110two to 0.01two.
The sum is: 10.0two. The exponent is still the
exponent of X, which is 2.
- If the sum in the previous step does not
have a single bit of value 1, left of the radix point, then
adjust the radix point and exponent until it does.
In this case, the sum, 10.0two, has two bits
left of the radix point. We need to move the radix point left
by 1, and increase the exponent by 1 to compensate.
This results in: 1.000 x 23.
- Convert back to the one byte floating
point representation.
| Sum |
sign |
exponent |
fraction |
| X + Y |
0 |
1010 |
000 |
Example 2
Let's add the following two numbers:
| Variable |
sign |
exponent |
fraction |
| X |
0 |
1001 |
110 |
| Y |
0 |
0110 |
110 |
Here are the steps again:
- First, convert the two representations to
scientific notation. Thus, we explicitly represent the hidden
1.
In normalized scientific notation, X is 1.110 x
22, and Y is 1.110 x 2-1.
- In order to add, we need the exponents of the
two numbers to be the same. We do this by rewriting Y.
This will result in Y being not normalized, but value is
equivalent to the normalized Y.
Add x - y to Y's exponent.
Shift the radix point of the
mantissa (signficand) Y left by x - y to compensate
for the change in exponent.
The difference of the exponent is 3. So, add 3 to Y's
exponent, and shift the radix point of Y left by 3. This
results in 0.00111 x 22. This is still equivalent to
the old value of Y. Call this readjusted value, Y'
- Add the two mantissas of X and the
adjusted Y' together.
We add 1.110two to 0.00111two.
The sum is: 1.11111two. The exponent is still the
exponent of X, which is 2.
- If the sum in the previous step does not
have a single bit of value 1, left of the radix point, then
adjust the radix point and exponent until it does.
In this case, the sum, 1.11111two, has a single
1 left of the radix point. So, the sum is normalized. We do not
need to adjust anything yet.
So the result is the same as before: 1.11111 x 23.
- Convert back to the one byte floating
point representation.
We only have 3 bits to represent the fraction. However, there
were 5 bits in our answer. Obviously, it looks like we should
round, and real floating point hardware would do rounding.
However, for simplicity, we're going to truncate the additional
two bits. After truncating, we get 1.111 x 22.
We convert this back to floating point.
| Sum |
sign |
exponent |
fraction |
| X + Y |
0 |
1010 |
111 |
This example illustrates what happens if the exponents are
separated by too much. In fact, if the exponent differs by 4 or
more, then effectively, you are adding 0 to the larger of the two
numbers.
Negative Values
So far, we've only considered adding two non-negative numbers.
What happens with negative values?
If you're doing it on paper, then you proceed with the sum as
usual. Just do normal addition or subtraction.
If it's in hardware, you would probably convert the mantissas to
two's complement, and perform the addition, while keeping track of the
radix point (read about fixed point
representation.
Bias
Does the bias representation help us in floating point addition?
The main difficulty lies in computing the differences in the exponent.
Still, that's not so bad because we can just do unsigned subtraction.
For the most part, the bias doesn't pose too many problems.
Overflow/Underflow
It's possible for a result to overflow (a result that's too large
to be represented) or underflow (smaller in magnitude than the
smallest denormal, but not zero). Real hardware has rules to handle
this. We won't worry about it much, except to acknowledge that it
can happen.
Summary
Adding two floating point values isn't so difficult. It basically
consists of adjusting the number with the smaller exponent (call this
Y) to that of the larger (call it X), and shifting the
radix point of the mantissa of the Y left to compensate.
Once the addition is done, we may have to renormalize and to
truncate bits if there are too many bits to be represented.
If the differences in the exponent is too great, then the adding
X + Y effectively results in X.
Real floating point hardware uses more sophisticated means to round
the summed result. We take the simplification of truncating bits if
there are more bits than can be represented.
Web Accessibility