Thurs, 17 Oct 2014, 03:57pm

I just got email from a student asking about a point I did not have time
to discuss in class on Wednesday.  And since you do have a problem on the
problem set, I decided to respond to all of you.

A student writes:

 Dear Sir,

> I noticed that in problem set 3, we have a question wherein we write the
> vector processing code. The array length however, is more than the number
> of registers in the vector register. Hence, we end up writing the vector
> code twice.
> Considering that vector processors are used for huge data sets, seems like
> this loop-unrolled kind of assembly would lead to huge code length.
> I was wondering if this is how it is done in actual processors?
> Shouldn't the hardware take care of this? I mean, assembly should be
> allowed to have a large value in VLN, and the hardware figures out if it
> needs to execute the loop several times. Doesn't seem too difficult to
> implement in hardware.
> Thanks,
> <<name withheld to protect the student who wants to do it in hardware>>

Thanks for the question.  It gives me the opportunity to teach you another
term that we did not have time to talk about in class on Wednesday.

You are asking what happens if the vector registers have 64 components,
but the loop is to be executed more than 64 times.  That is, for example,
"for i = 1,1000" ...

We could certainly control this in hardware.  It would not be "too"
complicated, but it would likely be complicated enough to increase the
cycle time, and the basic message is to not increase cycle time unnecessarily.
...and to try to make the vector computer as streamlined as possible.

Another mantra is that if it is just as easy to do in software, it is
probably better to do it in software.  This is a case where it is just as
easy to do it in software.  We set the length to 64, the stride to whatever
we need it to be, put the starting address in Rx, and do the VLD: VLD V1,Rx.
Then we add 64 to Rx and do the VLD again.  We keep track of the number of
iterations of this loop (in this case, 15), and then do one last iteration
with VLEN set to 40, since 15 x 64 + 40 = 1000.

So, you don't really get the code bloat you would get by replicating the
inner loop, although you certainly do get some control instructions before
you get to the inner loop.

And, the term I want to teach you: STRIPMINING.  You recall the way they do
coal mining in West Virginia, successively stripping off the coal one layer
at a time.  You see the metaphor?  We are stripping off 64 iterations at a

Hope this helps.

Yale Patt