Thursday, September 03, 2009 1:18 AM,



Since it is clear that many of you need more explanation with respect to 
floating point numbers, I am writing this.  Hopefully it will help.  Please 
try to read it before Friday's discussion section so you can ask the TA any 
questions on points that are still confusing.

First, the reason we need the floating point data type: Suppose we have
32 bits to represent a number.  With 2's complement, we can represent numbers 
as large as 2^31 -1.  What if we want to represent much larger numbers but we 
do not care about doing so with the same degree of precision.

That is what if we are willing to sacrifice 8 of the bits previously used for 
precision in order to be able to express much, much larger numbers.

While we are at it, if x is a number greater than 1, then 1/x is less than 1.
Wouldn't it also be great to express numbers much less than 1.  

So, we define the 32-bit data type as follows:

Bit 31: the sign bit.  0 = +, 1 = minus.
Bits [30:23], the exponent bits. 
Bits [22:0], the significance bits, often called the fraction bits.

AND, we can express values as +/- 1.bbbbbbb...b  * 2^(some exponent)

With bits [30:23] we can represent 256 different values.  Let them be the 
values from 0 (00000000) to 255 (11111111) - unsigned numbers.  They can each 
represent a different exponent.  That is, we can represent 256 different 
exponents with these 8 bits.  We are free to pick any 256 exponents that suits 
us.  For example, we could let 10101010 represent the exponent 2,345,545.
But that would be silly.

A better choice for the 256 representations is a set of 256 contiguous integer 
exponents.  In fact, what we have done is remove the two 
representations00000000 and 11111111, leaving 254 representations, and 
assigning them to the
254 exponents from -126 to +127.  We make the assignment as follows:

00000001 represents the real exponent -126 00000010 repreesnts the real exponent -125
00000011 repreesnts the real exponent -124 00000100 repreesnts the real exponent -123
00000101 repreesnts the real exponent -122 ...
01111111 represents the real exponent 0
10000000 represents the real exponent +1
10000001 represents the real exponent +2 10000010 represents the real exponent +3 ...
11111110 repreesnts the real exponent +127

The code that allows us to do that we call an excess127 code because the 
representation is formed by adding the excess to the real exponent.

For example, if I wanted to represent 1 1/2 in floating point, the binary form 
would be 1.1 * 2^0.  Bits [30:23] would contain 01111111, because the real 
exponent is 0 and if I add 127 to 0, I get 127 which is 01111111.

We could have chosen the excess differently. Someone in class suggested an 
excess of 128, the midpoint between 0 and 255.  We could have chosen 128.  But 
there are good numerical reasons for the choice of 127.  It is not worth 
getting into the reasons for doing so in EE 306.  The point here is to 
understand that we can represent numbers in many different ways, and floating 
point is one of those ways.  The exponent in a floating point data type uses 
this excess127 code for determining the representation.

Now the fraction part.

We said 1 1/2 was 1.1 * 2^0.  Since we have 23 bits to represent the fraction, 
we fill this out with a lot of 0s.  If we agree that we always (when possible) 
represent numbers in normalized form, we do not have to worry about where the 
binary point is - it is always after the first digit.  Furthermore, we do not 
need to represent that first digit since it is always a 1.  So, we do not 
store the 1 to the left of the binary point.  We only store the remaining 
digits.

The result.

1 1/2 is 1.1 * 2^0 is represented as: 0 01111111 10000000000000000000000

OK?

If we want to recover the value that is represented, we first note that the 
exponent field contains an exponent larger than 00000000 and smaller than 
11111111.  Thus we have a value represented in normalized form.

We see bit 31 is 0. That means we have a positive value.
We see bits [30:23] is 01111111, or 127.  We subtract the offset, giving us 0.
We see bits [22:0] is 10000000000000000000000. We know we need to put in the
1 to the left of the binary point that we did not bother to store, giving us:

1.10000000000000000000000 * 2^0, which is 1 1/2.


Next item: NORMALIZING:

Suppose we want to determine the representation of a number not in normalized 
form.  How do we normalize it.  Say our number is 0.0001111 * 2^1.

Normalizing simply means moving the binary point such that there is exactly 
one digit to the left of the binary point and then correcting the exponent 
accordingly.  In this case, we need to move the binary point four places to 
the right, which effectively multiplies the number by 2^4.  To compensate, we 
need to divide the number by 2^4.  The result: 1.111 * 2^(-3).  Do you see 
why?

Some numbers are so small that when I divide by the power of two to normalize 
them, the resulting exponent is less than -126.  These numbers can not be 
represented in normalized form.  They correspond to numbers having magnitude 
less than 2^(-126) and obey a very different representation system.  In the 
case of these very small numbers the exponent field is 00000000.  In fact, the 
value zero is one such number.  Its representation is 


		0 00000000 00000000000000000000000

The exponent 11111111 is used for other purposes.  For our purposes, two 
useful concepts we like to represent is +infinity and -infinity.  We use



		0 11111111 00000000000000000000000 for + infinity, and 
		1 11111111 00000000000000000000000 for - infinity. 


The other 254 exponents are used for normalized numbers, as described above.

Hope the above helps.

Next, I will show you how to turn a decimal fraction into normalized form.
But since this email is long enough, I will do that next.

Good luck.
Yale Patt