Fri, 13 Feb 2009, 02:30

WARNING: This email is long.  AND, you have a lab assignment to get done
between now and Sunday night.  So, feel free to file this away until you 
have some time.  However, these questions are perceptive, so I decided
to share this with you.

A student writes:

	Dear Prof. Patt
	I am a graduate student who is doing your EE360N course this semester. 
	I have a few questions related to the LC-3b uArch would like to ask you.
	1). I have noticed there are three units in the datapath which is 
	doing addition, these are the main ALU, the little one does +2 for 
	PC and the one for calculating the offset. I am wondering that why 
	cannot merge these three adders into the main ALU and we can add 
	MUX(es) to select the right signals to feed into the ALU.

You probably could.

	uArch is about dealing with tradeoffs,

Indeed!  uArch is ALL about dealing with tradeoffs.

	 so the pros are: i). less hardware (for speed, a fast adder 
	implementation is needed, the knowledge from the VLSI course 
	tells me that the parallel adders, such as the Kogge Stone adder 
	uses a lot of hardware compared to the MUX).  ii). Smaller chip area.
	iii). Faster cycle speed (smaller area, shorter interconnection wiring,
	smaller interconnection delay). iv). Probably less power consumption.
	And the major con is adding hardware for MUX which I think is less 
	expensive than implementing separate adders.

	I believe that the current datapath design for LC-3b is the most 
	optimized solution so there must be something incorrect or missing 
	from what I am thinking right now.

Actually, most of your thinking is not far off track.  The truth is I designed
the data path to make it as simple as possible and still able to do the job,
so I wasn't terribly concerned that I put in three adders.  BUT, if I had been
concerned with optimizing, I probably still would have ended up with this data
path.  Let's look at the pieces one-by-one.

Your optimized-for-speed Kogge-Stone adder.  Reminds me of my analogy in class
the other day.  If the boy runs to the shoemaker and comes back in ten minutes
rather than 20 minutes, when can I leave the shopping center?  Recall that
one?  Answer: 30 minutes - after the other boy comes out of the supermarket.
What's the point?  The Kogge-Stone adder only makes sense if the ALU is on
the critical path.

One ALU vs three.  To do all of the above with one ALU would require a lot more
wiring (which consumes area and power) and more sophisticated muxes which 
consumes logic, area and muxes, and time since the paths are now longer.  My
PC+2 adder takes very little logic, and even my address adder is small 
compared to an ALU, so my guess: when all is said and done, my simple design
with the extra two adders probably ends up saving area, power, and time.

Reducing wiring and getting rid of extra muxes usually is a win.

	2). My second question is about the use of R6 for storing SSP and 
	USP. I actually asked one of the TAs for this question, but I still 
	don't quite get it. The question is why there is a necessity to use 
	R6 as a reserved register for storing the stack pointers? Why cannot 
	we use SSP and USP straight away?
	I understand that when there is an interrupt or exception, the current 
	value in R6 will be stored to USP, and the value from the SSP will be 
	loaded into R6. Then R6 is treated as the supervisor stack pointer, 
	the value in R6 will be incremented by 2 whenever we want to push 
	new stuff onto the stack. Finally, when there is a return call after 
	the ISR (in RTI), the current value in R6 will be stored into SSP 
	register and the one in USP will be loaded back into R6.
	Now, what I am thinking is since the SSP and USP registers are 
	physically there, we can use them directly and free R6. Whenever 
	we want to push new stuff onto the stack we increment the value 
	in SSP by 2 in the previous cycle and load this value in the SSP 
	register into MAR. The SSP value needs to be decremented by 2 when 
	we want to pop stuff out of the stack. The uArch implementation of 
	this can be achieved like the way PC=3DPC+2 does and use a MUX to 
	select whether we want SSP-2 or SSP+2 or just SSP and we just need 
	some more control signals in the control store.

	The major pro of doing this R6 is now free to be used for other 
	general purposes and is no longer only reserved for SSP and USP.
	And similar argument can be applied to R7 for storing the 
	returning PC.  

I guess the question is whether you want to allow information to be pushed
and popped onto the stack under program control.  That is, do you want the
program or operating system to be able to push/pop the stack.  We can do that

		LDW R0,R6,#0
		ADD R6,R6,#2 

for example.  If we did it your way, how would you push pop the stack.  You 
could add PUSH and POP opcodes, but you know how jealously I guard the number
of allowable opcodes.  OR, you make USP and SSP visible to the ISA, and
simply replace R6 in the code above with either USP or SSP.  But now I have
ten registers instead of 8.  If I wanted to have ten registers, wouldn't
it make more sense to add R9 and R10, and leave R6 for handling USP and SSP.
I think this would be more efficient use of the registers.

Finally, your similar argument about R7 for storing and returning PC.  What
if you have nested calls/returns.  How would you save the returnPC before
clobbering it with another JSR if it was not in a general purpose register.
Similar response as for USP/SSP.

	3). I have also noticed that on the LC-3b state diagram, the 
	state 10 and 11 lead to nowhere. So what an accessing to these 
	states will be considered? Last time you told us that a good 
	compiler should know everything about the ISA, so a good complier 
	will giving us warnings (or probably errors) whenever the program 
	results an accessing to state 10 and 11 (or in general
	to any empty states), but how about during the run time?

Whoa!  A compiler takes as input a program written in a high level language,
like JAVA, C, C++, Fortran, COBOL, SNOBOL, PL-1, APL, and on and on.  High
level languages do not have opcodes like 10 and 11.  In fact, the job of the
compiler is to take the high level language program and translate it into a
program in the ISA of the LC-3b.  THEREFORE, a good compiler, knowing that
1010 and 1011 are unused opcodes, would never produce code with those two
unused opcodes in them.

Right now, if a program writes a program in the ISA of the LC-3b, and uses
either of those two opcodes, the LC-3b microarchitecture would be upset since
they do not correspond to real opcodes.  What will the microarchitecture do?
Stay tuned to the lecture on interrupts and exceptions, which we will have 
BEFORE I ask you to augment the state machine to deal with this so those arcs
are not hanging out in space.  That will be part of the subject of lab 4.
	There are a couple of options that I can think of right now: 
	i). produce error and halt the running program (not a good solution)
	ii). Trap into 10 and 11 and keep looping until a timeout interrupt 
	happens and then jump to the special ISR for handling this kind of 
	event (probably the better solution).
	I am wondering what a truly good solution would be.

	Those are all the questions that I have so far 
	and thank you very much in advance for your answers.

My pleasure.  I hope your lab 2 is done.  If not, good luck getting it in
before midnight on Sunday.

Yale Patt

	Best Regards,
	<<name withheld to protect the student who is thinking in overdrive>>