Thursday, September 03, 2009 1:18 AM,
Since it is clear that many of you need more explanation with respect to floating point numbers, I am writing this. Hopefully it will help. Please try to read it before Friday's discussion section so you can ask the TA any questions on points that are still confusing. First, the reason we need the floating point data type: Suppose we have 32 bits to represent a number. With 2's complement, we can represent numbers as large as 2^31 -1. What if we want to represent much larger numbers but we do not care about doing so with the same degree of precision. That is what if we are willing to sacrifice 8 of the bits previously used for precision in order to be able to express much, much larger numbers. While we are at it, if x is a number greater than 1, then 1/x is less than 1. Wouldn't it also be great to express numbers much less than 1. So, we define the 32-bit data type as follows: Bit 31: the sign bit. 0 = +, 1 = minus. Bits [30:23], the exponent bits. Bits [22:0], the significance bits, often called the fraction bits. AND, we can express values as +/- 1.bbbbbbb...b * 2^(some exponent) With bits [30:23] we can represent 256 different values. Let them be the values from 0 (00000000) to 255 (11111111) - unsigned numbers. They can each represent a different exponent. That is, we can represent 256 different exponents with these 8 bits. We are free to pick any 256 exponents that suits us. For example, we could let 10101010 represent the exponent 2,345,545. But that would be silly. A better choice for the 256 representations is a set of 256 contiguous integer exponents. In fact, what we have done is remove the two representations00000000 and 11111111, leaving 254 representations, and assigning them to the 254 exponents from -126 to +127. We make the assignment as follows: 00000001 represents the real exponent -126 00000010 repreesnts the real exponent -125 00000011 repreesnts the real exponent -124 00000100 repreesnts the real exponent -123 00000101 repreesnts the real exponent -122 ... 01111111 represents the real exponent 0 10000000 represents the real exponent +1 10000001 represents the real exponent +2 10000010 represents the real exponent +3 ... 11111110 repreesnts the real exponent +127 The code that allows us to do that we call an excess127 code because the representation is formed by adding the excess to the real exponent. For example, if I wanted to represent 1 1/2 in floating point, the binary form would be 1.1 * 2^0. Bits [30:23] would contain 01111111, because the real exponent is 0 and if I add 127 to 0, I get 127 which is 01111111. We could have chosen the excess differently. Someone in class suggested an excess of 128, the midpoint between 0 and 255. We could have chosen 128. But there are good numerical reasons for the choice of 127. It is not worth getting into the reasons for doing so in EE 306. The point here is to understand that we can represent numbers in many different ways, and floating point is one of those ways. The exponent in a floating point data type uses this excess127 code for determining the representation. Now the fraction part. We said 1 1/2 was 1.1 * 2^0. Since we have 23 bits to represent the fraction, we fill this out with a lot of 0s. If we agree that we always (when possible) represent numbers in normalized form, we do not have to worry about where the binary point is - it is always after the first digit. Furthermore, we do not need to represent that first digit since it is always a 1. So, we do not store the 1 to the left of the binary point. We only store the remaining digits. The result. 1 1/2 is 1.1 * 2^0 is represented as: 0 01111111 10000000000000000000000 OK? If we want to recover the value that is represented, we first note that the exponent field contains an exponent larger than 00000000 and smaller than 11111111. Thus we have a value represented in normalized form. We see bit 31 is 0. That means we have a positive value. We see bits [30:23] is 01111111, or 127. We subtract the offset, giving us 0. We see bits [22:0] is 10000000000000000000000. We know we need to put in the 1 to the left of the binary point that we did not bother to store, giving us: 1.10000000000000000000000 * 2^0, which is 1 1/2. Next item: NORMALIZING: Suppose we want to determine the representation of a number not in normalized form. How do we normalize it. Say our number is 0.0001111 * 2^1. Normalizing simply means moving the binary point such that there is exactly one digit to the left of the binary point and then correcting the exponent accordingly. In this case, we need to move the binary point four places to the right, which effectively multiplies the number by 2^4. To compensate, we need to divide the number by 2^4. The result: 1.111 * 2^(-3). Do you see why? Some numbers are so small that when I divide by the power of two to normalize them, the resulting exponent is less than -126. These numbers can not be represented in normalized form. They correspond to numbers having magnitude less than 2^(-126) and obey a very different representation system. In the case of these very small numbers the exponent field is 00000000. In fact, the value zero is one such number. Its representation is 0 00000000 00000000000000000000000 The exponent 11111111 is used for other purposes. For our purposes, two useful concepts we like to represent is +infinity and -infinity. We use 0 11111111 00000000000000000000000 for + infinity, and 1 11111111 00000000000000000000000 for - infinity. The other 254 exponents are used for normalized numbers, as described above. Hope the above helps. Next, I will show you how to turn a decimal fraction into normalized form. But since this email is long enough, I will do that next. Good luck. Yale Patt