Ints are also contiguous. That is, between the minimum and maximum int, there are no missing values.
To summarize:
What issues comes up when trying to devise a data representation for floating point numbers? It turns out these issues are more complicated than representing integers. While most people agree that UB and 2C are the ways to represented unsigned and signed integers. Representing real numbers has traditionally been more problematic.
In particular, depending on which company manufactured the hardware, there were different ways to represent real numbers, and to manipulate it. There's no particularly obvious choice of how to represent real numbers. In the mid 1980's, the need for uniform treatment of real numbers (called floating point numbers) lead to the IEEE 754 standard.
Standards are often developed to give consistent behavior. For example, when C was first being developed, what was considered a valid C program very much depended on the compiler. A program that compiled on one C compiler might not compile on another. Effectively, many different flavors of C were being created, and the need to have a standard definition of a language was seen as important.
Similar, for purposes of agreeing on results performed on floating point numbers, there was a desire to standardize the way floating point numbers were represented.
Before we get to such issues, let's think about what restrictions will have to be imposed on floating point numbers.
Definition Precision refers to the number of significant digits it takes to represent a number. Roughly speaking, it determines how fine you can distinguish between two measurements.A number can be precise, without being accurate. For example, if you say someone's height is 2.0002 meters, that is precise (because it is precise to about 1/1000 meters). However, it may be inaccurate, because a person's height may be significantly different.Definition Accuracy is how close a measurement is to its correct value.
In science, precision is usually defined by the number of significant digits. This is a different kind of precision than you're probably used to. For example, if you have a scale, you might be lucky to have precision to one pound. That is, the error is +/- 1/2 pound. Most people think of precision as the smallest measurement you can make.
In science, it's different. It's about the number of significant digits. For example, 1.23 * 1010 has the same precision as 1.23 * 10-10, even though the second quantity is much, much smaller than the first. It may be unusual, but that's how we'll define precision.
When we choose to represent a number, it's easier to handle precision than accuracy. I define accuracy to mean the result of a measured value to its actual value. There's not much a computer can do directly to determine accuracy (I suppose, with sufficient data, use statistical methods to determine accuracy).
We can't do much about the accuracy of the recorded value (without additional information). However, the hardware does perform mathematical operations reasonably accurately. The reason the computations aren't perfectly accurate is because one needs infinitely precice math, and that requires an infinite number of bits, which doesn't exist on a computer.
Since floating point numbers can not be infinite precise, there's always a possibility of error when performing computations. Floating point numbers often approximate the actual numbers. This is precisely because floating point numbers can not be infinitely precise.
In the field on computer science, numerical analysis is concerned with ways of performing scientific computations accurately on a computer. In particular, there are ways to minimize the effect of "round-off errors", errors that are due to the approximate nature of floating point representation.
+/- S X 10expwhere S is the significand or the mantissa, and exp is the exponent. "10" is the base.
In scientific notation, there can be more than one way of writing the same value. For example, 6.02 X 1023 is the same as 60.2 X 1022 is the same as 602 X 1021.
For any number represented in this way, there are an infinite number of other ways to represent this (by moving the decimal point, and adjust the exponent).
It might be nice to have a single, consistent way of doing this,
i.e. a canonical or standard way of doing this. And, so there
is such a way.
+/- D.FFFF X 10exp
You can write the significand as D.FFF... where
1 <= D <= 9, and FFF... represents the
digits after the decimal point. If you follow this restriction
on D, then there's only one way to write a number
in scientific notation. The obvious exception to this rule
is representing 0, which is a special case.
You can generalize this formula for other bases than base 10. For
base K (where K > 1), you write the canonical scientific
notation form as:
+/- D.FFFF X Kexp
where 1 <= D <= K - 1. The F's that appear in the
fraction must follow the rule: 0 <= D <= K - 1.
The number of significant digits, which is also the precision, is the number of digits after the radix point. We call it the radix point rather than the decimal point because decimal implies base 10, and we could be talking about any other base.
+/- D.FFFF X 2expThis creates an interesting constraint on D. In particular, 1 <= D <= 1, which means D is forced to be 1. We'll use this fact later on.
There is a standard for single precision (32 bits) and double precision (64 bits). We'll primarily focus on single precision.
| Sign | Exponent | Fraction |
| 1 | 8 bits | 23 bits |
| b31 | b30-23 | b22-0 |
| X | XXXX XXXX | XXXX XXXX XXXX XXXX XXXX XXX |
An IEEE 754 single precision number is divided into three parts. The three parts match the three parts of a number written in canonical binary scientific notation.
This is b31. If this value is 1, the number is negative. Otherwise, it's non-negative.
The exponent is excess/bias 127. Normally, one expect the excess/bias to be half the number of representations. In this case, the number of representations is 256, and half of that is 128. Nevertheless, the excess is 127. Thus, the range of possible exponents is -127 <= exp <= 128.
Normally, you would represent the significant (also called the mantissa. This would mean representing D.FFFF.... However, recall that for base 2, D = 1. Since D is always 1, there's no need to represent the 1. You only need to represent the bits after the radix point.
Thus, the "1" left of the radix point is NOT explicitly represented. We call this the hidden one.
There is something of a fallacy when we say that a IEEE single precision has 24 bits of precision. In particular, it's very much like saying that a calculator with 15 digits has 15 digits of precision. It's true that it may represent all numbers with 15 digits, but the question is whether the value is really that precise.
For example, suppose a measured number has only 3 digits of precision. There's no way to indicate this on a calculator. The calculator is prepared to have 15 digits of precision, even though that's more accurate than the number.
The same can be said about representing floating point numbers. It has 24 bits of precision, but that may not accurately represent the true number of significant bits. Unfortunately, that's the best computers can do. One could store additional information to determine exactly how many bits really are significant, but this is not usually done.
You might wonder why they do this. Here's one reason. Given the representation, as is, there would be no way to represent 0. If all the bits were 0, this would be the number 1.0 X 2-127. Although this is a small number, it's not 0.
Thus, we designate the bitstring containing all 0's to be zero.
The following is a list of categories of floating point numbers in IEEE 754.
| Category | Sign Bit | Exponent | Fraction |
| Zero | Anything | 08 | 023 |
| Infinity | Anything | 18 | 023 |
| NaN | Anything | 18 | Not 023 |
| Denormalized numbers | Anything | 08 | Not 023 |
| Normalized numbers | Anything | Not 08, nor 18 | Anything |
Again, we write 08 to mean 0, repeated 8 times, i.e. 0000 0000.
Notice that there is a positive and negative 0.
A bitstring with 32 zeros would be 1.0 X 2-127. That's a pretty small value. However, we could represent numbers to get even smaller, if we do the following when the exponent is 08.
Recall that the exponent is written with a bias of 127. So, you would expect that if the exponent is 08, this bitstring would represent the exponent -127, and not -126. However, there's a good reason why it's -126. We'll explain why in a moment. For now, let's accept the fact that the exponent is -126 whenever the exponent bitstring is 08.
S Exp Fraction - --------- ---------------------------- 0 0000 0000 1111 1111 1111 1111 1111 111This bitstring maps to the number 0.(123) x 2-126. This number has 23 bits of precision, since there are 23 1's after the radix point.
S Exp Fraction - --------- ---------------------------- 0 0000 0000 0000 0000 0000 0000 0000 001This bitstring pattern maps to the number 0.0221 x 2-126, which is 1.0 x 2-149. This number has 1 bit of precision. The 22 zeroes are merely place holders and do not affect the number of bits of precision.
You may not believe that this number has only 1 bit of precision, but it does. Consider the decimal number 123. This number has 3 digits of precision. Consider 00123. This also has 3 digits of precision. The leading 0's do not affect the number of digits of precision. Similary, if you have 0.000123, the zeroes are merely to place the 123 correctly, but are not significant digits. However, 0.01230 has 4 significant digits, because the rightmost 0 actually adds to the precision.
Thus, for our example, we have 22 zeroes followed by a 1 after the radix point, and the 22 zeroes have nothing to do with the number of significant bits.
By using denormalized numbers, we were able to make the smallest positive float to be 1.0 X 2-149, instead of 1.0 X 2-127, which we would have had if the number had been normalized.
Thus, we were able to go 22 orders of magnitude smaller, by sacrificing bits of precision.
To answer this question, we need to look at the smallest
positive normalized number. This occurs with the following bitstring
pattern
S Exp Fraction
- --------- ----------------------------
0 0000 0001 0000 0000 0000 0000 0000 000
This bitstring maps to 1.0 x 2-126.
Let's look at the two choices for the largest positive denormalized numbers.
Also notice that the number with the -126 as exponent is larger than the number that has -127 as exponent (both have the same mantissa/significand, and -126 is larger than -127).
Thus, by picking -126 instead of -127, the gap between the largest denormalized number and the smallest normalized number is smaller. Is this a necessary feature? Is it really necessary to make that gap small? Maybe not, but at least there's some rationale behind the decision.
Thus, 1010 is 10102.
Thus, .2510 is .012.
This results in 1010 + 0.01, which is 1010.01.
This is 1010.01 X 20, which is 1.01001 X 23.
This is 1010.01 X 20, which is 1.01001 X 23.
S Exp Fraction - --------- ---------------------------- 0 1000 0010 0100 1000 0000 0000 0000 000Notice that the hidden "1" is not represented in the fraction.
First, we take advantage of the following fact: 1000 0000 maps to exponent +1 in excess 127. If this were excess 128, it would map to 0. It would be nice, in fact, if it were excess 128, because then we would write out the positive number in unsigned binary, then flip the most significant bit from 0 to 1, and we'd be done. (Verify this for yourself with an example or two).
However, excess 127 and excess 128 are only off-by-one, so it's not too hard to adjust the algorithm appropriately. Here's what you do to convert positive exponents to excess 127.
Before you memorize this algorithm, you should really try to understand where it comes from.
This is where it comes from. Consider a positive exponent, x, represented in base 10. To convert it to excess 127, we add 127. Thus, we have x + 127. We can rewrite this as: (x - 1) + 128. This is simple algebra.
128 is 1000 0000 in binary. And we have x - 1, which is where the subtraction of 1 occurs. As long as x - 1 is smaller than 128 (and it will be, since the maximum value of x is 128), then it's easy to add this binary number to 1000 0000.
Remember, memorization is a poor second to understanding. It's better to understand why something works than to memorize an answer. However, it's even better to understand why something works and remember it too.
You'd get stuck trying to convert the exponent, because you'd discover the number is negative, and the number has to be non-negative when convert from base 10 (after adding the bias) to UB.
You can save yourself this hassle if you recall that the smallest, positive normalized number has an exponent of -126, and that the exponent we have is -128, which is less than -126. If you've written the number in binary scientific notation (in canonical form), and the exponent is less than -126, then you have a denormalized number.
Since -128 < -126, the number we're trying to represent is a denormalized number.
The rules for representing denormalized numbers is different from representing normalized numbers.
To represent a denormalized number, you need to shift the radix point so that the exponent is -126. In this case, the exponent must be increased by 2 from -128 to -126, so the radix point must shift left by 2.
This results in: 0.011 x 2-126
At this point, it's easy to convert. The exponent bitstring is 08. You copy the bits after the radix point into the fraction. The sign bit is 0.
S Exp Fraction
- --------- ----------------------------
0 0000 0000 0110 0000 0000 0000 0000 000
We could add one more bit to the fraction. At least, that would cause the least amount of disruption. Would that one additonal bit help us in any meaninful way? On the one hand, it allows us to represent twice as many floating point numbers. On the other, it does so by adding a single bit of precision.
Perhaps through this kind of reasoning, the developers of the IEEE 754 standard felt that having an unsigned float did not make sense, and thus there is no unsigned float in IEEE 754 floating point.
Why do it in that order?
Here's a plausible explanation. Suppose you want to compare two dates. The date includes month, day, and year. You use two digits for the month, two for the day, and four for the year. Suppose you want to store the date as a string, and want to use string comparison to compare dates.
Which order should you pick?
You should pick the year, the month, and the day. Why? Because when you are doing string comparison, you compare left to right, and you want the most significant quantity to the left. That's the year.
When you look at a floating point number, the exponent is the most important, so it's to the left of the fraction.
You can also do comparisons because the exponent is written in bias notation (you could use two's complement, as well, although it would make the comparison only a little more complicated).
So why is the sign bit to the far left? Perhaps the answer is because that's where it appears in signed int representation. It may be unusual to have the sign in any other position.